Processing Large Datasets for ADAS applications using Spark
Transcript of Processing Large Datasets for ADAS applications using Spark
1© 2020 The MathWorks, Inc.
Processing Large Datasets for ADAS applications using Spark
Arvind Hosagrahara ([email protected])Lead Solutions ArchitectMay 2021
2
Outline
§ The Big Picture– Problem Statement / Challenges
§ Architecture of the system (ETL, Training, Inference)§ Deep-Dive into Spark Workflows
– Cluster Management– Interactive exploration – Pushdown of analytics to Spark Clusters– Experiment and metadata management
§ Design considerations– Scaling, Accuracy and Interoperability– Security, Governance, Model Lifecycle management
§ Conclusions / Key Takeaways§ Q&A
3
The Big Picture – Automated Driving
How can I analyze & simulate
scenarios?
How can I design & deploy
algorithms?
How can I integrate & test
systems?
Control
Planning
Perception
4
Semantic Segmentation
What?§ Semantic segmentation is the classification
of every pixel in an image/video
Why?§ Semantic segmentation: Higher quality
perception outcomes, easier to analyze and validation of localization modules for AD.
§ Automation: Faster design iterations
5
“…autonomous vehicles would have to be driven hundreds of millions of miles and sometimes hundreds of billions of miles to demonstrate their reliability in terms of fatalities and injuries.” [1]
[1] Driving to Safety, Rand Corporation https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf
6
Problem Statement
How can we:1. Accelerate development of semantic segmentation models from prototype to production in order to build better perception software
2. Scale the workflows, capabilities and throughput of engineering processes by leveraging automation, cloud storage, compute and other infrastructure
7
Challenges
Business Challenges§ Prototype to production: establish system & software engineering processes§ Verification and validation: connect data, embedded software and simulation§ Workforce development: hire & integrate engineers with disparate
background
Technical Challenges§ Handling large datasets and specialized formats§ Leveraging the cloud to scale computational requirements§ Enable self-serve analytics for the data scientist / engineer
10
Demo / Results3-D simulation data to train a semantic segmentation
network and fine-tune it to real-world data using generative adversarial networks (GANs)
13
Architecture of the demo system
Burst
Data access
Experiment M
anagement / G
overnance / Security
14
Burst
Data access
Experiment M
anagement / G
overnance / SecurityData access
Experiment M
anagement / G
overnance / Security
Development Environment
(Desktop / Cloud)
Storage(Large Datasets /
Models / Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Cloud-based training system
(Spark Clusters / GPU support / Fast networks)
Recorded and Simulated Data
(Batch and Streaming
Ingest)
Burst
Ingest
15
Data access
Experiment M
anagement / G
overnance / Security
Development Environment
(Desktop / Cloud)
Storage(Large Datasets /
Models / Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Cloud-based training system
(Spark Clusters / GPU support / Fast networks)
Recorded and Simulated Data
(Batch and Streaming
Ingest)
Burst
Ingest
16
A simplified view of the system architecture
Experiment M
anagement / G
overnance / Security
Development Environment(Desktop / Cloud)
Storage(Large Datasets / Models /
Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Cloud-based training system(Spark Clusters / GPU support /
Fast networks)
Recorded and Simulated Data
(Batch and Streaming Ingest)
Burst
Ingest
Data access
17
The ETL (Extract Transform Load) Pipeline
§ The ETL (Extract Transform Load) Pipeline
Experiment M
anagement / G
overnance / Security
Storage(Large Datasets / Models /
Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Recorded and Simulated Data
(Batch and Streaming Ingest)
Ingest
§
§
18
The Training Pipeline
§
Experiment M
anagement / G
overnance / Security
Development Environment(Desktop / Cloud)
Storage(Large Datasets / Models /
Metadata)
Cloud-based training system(Spark Clusters / GPU support /
Fast networks)Burst
Ingest
Data access § The Training Pipeline
§
19
The Inference Pipeline
§
Experiment M
anagement / G
overnance / Security
Storage(Large Datasets / Models /
Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Burst
Ingest
Data access §
§ The Inference Pipeline
20
The Software Stack
Experiment M
anagement / G
overnance / Security
Development Environment(Desktop / Cloud)
Storage(Large Datasets / Models /
Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Cloud-based training system(Spark Clusters / GPU support /
Fast networks)
Recorded and Simulated Data
(Batch and Streaming Ingest)
Burst
Ingest
Data access
21
The Software Stack
Experiment M
anagement / G
overnance / Security
Development Environment(Desktop / Cloud)
Storage(Large Datasets / Models /
Metadata)
Model Inference Engine for ETL and Performance
Monitoring
Monitoring / VisualizationFront-ends
Cloud-based training system(Spark Clusters / GPU support /
Fast networks)
Recorded and Simulated Data
(Batch and Streaming Ingest)
Burst
Ingest
Data access
MATLAB® and Simulink®
MATLAB Production
Server™
MATLAB Webapp Server™3P Dashboards
Amazon S3™
AWS Identity and Access
Management (IAM)
AWS™
MLflow™
23
Workflows supporting DevOps
Development Environment Operational Environment
TEST
OPER
ATE
DEPLOY
MONITOR
DESIGN
BUILD
PLAN
FEEDBACK
Cluster ManagementJDBC based
connectivity to data sources
Databricks-connect and interactive exploration
CI/CD integration
Jobs and Spark-Submit
Model Lifecycle Management
Experiment Management
Coder based acceleration
Embedded Applications
Inference engine
Central result, config, data
management
25
Workflow: Cluster management and Data Engineering via REST
Engineers
Developers
Scientists
Data Analyst
Compute Cluster(Cloud/On-Prem)
jobsubmission.mMATLAB
HTTPClient
MATLAB Integration (REST 2.0 API)
• Create Clusters• CRUD with DBFS• Create Jobs• Authentication / Tokens• Secrets• Notifications• Libraries• Workspace
Databricks specific runtime image
MATLAB Runtime
init_script Dockerfile
dockerized runtime
analytics.jar
26
Video 1 Outline
§ Planning capacity and performance§ Cluster creation§ Data and application transfer via the DBFS API
27
Workflow: JDBC and Database toolbox (a)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster(Cloud/On-Prem)
Database ToolboxODBC 2.6.x JDBC 2.6.x
Driver
SQL
28
Workflow: JDBC and Database toolbox (b)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster(Cloud/On-Prem)
ODBC 2.6.x JDBC 2.6.x
Driver
SparkSession
SQL
Databases / Delta Tables
29
Workflow: Databricks-connect
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster(Cloud/On-Prem)
Databricks-connect
Spark DAG
Spark Master
Spark API
32
The Training Pipeline (Local Experimentation)
§ Optimized for GPU§ High-level API (“write less and do
more”)§ Language abstractions to improve
accuracy (eg: imageDataAugmenter)§ Workflows for ease-of-use and
debug– apps for model selection– monitor accuracy, loss and training
performance– many quick-start examples
33
Workflow: Push-down via MATLAB Compiler
Engineers
Developers
Scientists
Data Analyst
Compute Cluster(Cloud/On-Prem)
spark-submit job for New Clusters
analytics.jar
Spark DAG
Spark Master
analytics.mMATLAB
Compiler/SDK
MATLAB API for Spark Tall Arrays
Libraries for Existing Clusters
34
Video 3 Outline
§ SparkSession and RDD abstractions§ Tall-based workflows
§ Deploytool and app based workflows– Compiler SDK for Java
§ mcc – Spark Submit
§ DBFS API to push application to Databricks
35
Build and Test
§ Improve quality of code and models by testing on CI Servers (self-hosted or Cloud-hosted)
§ Package libraries for deployment
§ Adhere to agile development practices
https://www.mathworks.com/solutions/continuous-integration.html
36
The Training Pipeline (Cloud-based training)
§ Infrastructure defined as code§ Leverage the latest hardware,
optimized drivers and toolkits§ APIs for command and control of
Jobs, artifacts, notifications, execution, etc.
§ Capital Expenses (CapEx) èOperating Expenses (OpEx)
38
Video 4 Outline
§ MLFlow Demo– Creation of Experiments / Runs– Logging of Parameters and Metrics– Integration with Experiment Management features
41
The Inference Pipeline
§ Directed Acyclic Graphs (DAG) used to resolve changes to:– Training models, ETL changes– The training data– New data / changes to the cache
§ Trained Models can be deployed as cloud services
§ Cloud Reference Architectures available on github.com to integrate with existing systems
42
Scaling Inference on the MATLAB Production Server
§ Scales gracefully § Optimal usage of
available CPU and GPU resources
§ Can increase throughput as desired
§ PAYG models available on the cloud
44
The ETL Pipeline
§ Wide variety of Automotive and Engineering Formats (eg: ROSBag, MP4, MDF, BLF, PCAP)
§ RESTful endpoints or Client based access§ Connectivity to Cloud Storage (S3) and distributed
streaming systems (Kafka)§ Horizontally scalable and designed to run
24x7x365 on headless cloud systems§ Provides data proximity when executing on the
cloud§ Accelerated CUDA execution on NVIDIA® via GPU
Coder™
46
Build Graph (DAG)
§ Engine code§ Trained models
§ New/Cached Data§ Operates on S3 version metadata
§ Can be triggered manually or automatically (eg: via CI/CD pipelines)
48
Scaling considerations
Storage§ Cloud based storage allows
ingestion of larger datasets
Compute§ Quicker training iterations both
locally and on the cloud§ Faster inference pipeline
Cost§ CapEx è OpEx
Workflows§ Self-serve analytics optimized for
agility along a maturity framework
"Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… … engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly [1].
[1] The Hidden Technical Debt of Machine Learning Systemshttps://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
49
Accuracy considerations
§ More Training Data§ Simulation / Scene generation
– Roadrunner
Simulation
50
Interop considerations
§ ONNX interop § Transfer learning techniques§ Import/Export to
– OpenDRIVE®, – FBX®, – glTF™,– OpenFlight,– OpenSceneGraph, – OBJ, and USD formats.
§ Automated driving simulators and game engines (eg: CARLA, VIRES VTD, NVIDIA DRIVE Sim®, Metamoto®, LGSVL, Baidu Apollo®, Unity®, and Unreal Engine®)
51
Embedded Targeting
§ GPU Coder™ Support Package for NVIDIA GPUs
§ NVIDIA® DRIVE™ and Jetson hardware platforms
53
Security Considerations
§ Cloud Storage Best Practices§ ACL based permissions§ IAM based identity and access
54
Conclusions
§ Well-architected systems [1] accelerate development of automated semantic segmentation against large datasets
§ Local development provides smooth workflows for the development and refinement of deep learning models
§ Cloud-based scaling of compute and storage can be leveraged on Databricks to enable self-service analytics
§ Techniques that are relevant for ADAS development are equally applicable across other domains such as medical, geo-exploration, etc.
[1] AWS Well-Architectedhttps://aws.amazon.com/architecture/well-architected/
55
Key Takeaways
§ Build upon proven, safety compliant, and DevOps-ready development tools
§ Leveraging best-in-class simulation integration platform and modeling tools leads to faster time-to-market outcomes for ADAS development
§ Domain specific tooling allows practitioners to write less and do more
§ If this is interesting, please contact us at [email protected] for more details or to get started.