D1.2 – Requirements, specifications and reference architecture · D1.2 V1.0 2 Document Control...

Cloud-LSVA

Large Scale Video Analysis

EUROPEAN COMMISSION

DG Communications Networks, Content & Technology

Horizon 2020 Research and Innovation Programme

Grant Agreement No 688099

D1.2 – Requirements, specifications and

reference architecture

Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020)

Deliverable no. D1.2

Dissemination level Public

Work Package no. WP 1

Main author(s) Olesya Zeybel, Marcos Nieto, Joachim Kreikemeier

Co-author(s) Jos den Ouden, Phil Jordan, Branimir Malnar, Suzanne Little

Version Nr (F: final, D:

draft)

F

File Name D1.2 Requirements, specifications and reference architecture

Project Start Date and

Duration

10 January 2018, 36 months

D1.2 V1.0

2

Document Control Sheet

Main author(s) or editor(s): Olesya Zeybel, Marcos Nieto, Joachim Kreikemeier Work area: WP 1 Document title: D1.2 – Requirements, specifications and reference architecture

Version history:

Approval:

Name Date

Prepared Olesya Zeybel 2018/01/29

Reviewed Marcos Nieto 2018/01/30

Authorised Oihana Otaegui 2018/01/30

Circulation:

Recipient Date of submission

EC

Cloud-LSVA consortium

Legal Disclaimer The information in this document is provided “as is”, and no guarantee or warranty is given that the information is fit for any particular purpose. The above referenced consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law. © 2018 by Cloud-LSVA Consortium.

Version number

Date Main author Summary of changes

V0.1 2018/01/10 Olesya Zeybel (Valeo) Initial document for D 1.2

V0.2 2018/01/16 Jos den Ouden (TUE), Phil Jordan (IBM), Marcos Nieto (Vicomtech)

Contributions to several sections.

V0.3 2018/01/24 Jaonary Rabararisoa (CEA) Pipeline engine

V0.4 2018/01/25 Branimir Malnar (INTEL) Automatic annotations on in-vehicle embedded platforms

V0.5 2018/01/29 Suzanne Little (DCU) Search engine

V1.0 2018/01/30 Olesya Zeybel (Valeo), Marcos Nieto (Vicomtech)

Integration of contributions Review and document closing.

D1.2 V1.0

3

Abbreviations and Acronyms

Acronym Definition

SW Software

HW Hardware

ADAS Advance Driver Assistance System

NCAP New Car Assessment Programme

AEB Autonomous Emergency Braking

HAD Highly automated Driving

IoT Internet of Things

V2X Vehicle to Everything

SLAM Simultaneous Localisation and Mapping

ISO International Standardisation Organisation

CAN Controlled Area Network

RTK Real Time Kinematics

DGPS Differential GPS

FOV Field of View

SDK Software Development Kit

TOSCA Topology Orchestration Specification for Cloud Applications

PaaS Platform as a Service

NAS Network Attached Storage

OS Operative System

SQL Structured Query Language

WDP Warp speed Data Transfer

GUI Graphical User Interface

VCD Video Content Description

SCD Scene Content Description

JSON JavaScript Object Notation

ViPER Visual Information Processing for Enhanced Retrival

XML Extensible Markup Language

INS Inertial Sensors

IMU Inertial Measurement Unit

ROS Robotic Operational System

OWL Web Ontology

W3C World Wide Web Consortium

RDF Resource Description Framework

D1.2 V1.0

4

Table of Contents Executive Summary ................................................................................................................................ 8

1. Introduction ...................................................................................................................................... 9

1.1 Purpose of Document ............................................................................................................. 9

1.2 Intended audience ................................................................................................................... 9

1.3 Related documents ................................................................................................................. 9

2. Use cases description .................................................................................................................... 11

2.1 Annotation for ADAS like application .................................................................................... 11

2.1.1 Phase 1: On Sidewalk / City Inter-urban ........................................................................... 12

2.1.2 Phase 2: On Roadway / City Inter-urban .......................................................................... 12

2.1.3 Phase 3: Realistic Combinations City / Highway .............................................................. 13

2.2 Annotation for Cartography generation ................................................................................. 13

2.2.1 Phase 1: Navigation prototype with a closed loop for traffic signs .................................... 14

2.2.2 Phase 2: Road markings and type classification .............................................................. 14

2.2.3 Phase 3: Lane navigation prototype for HAD cars with closed loop for incremental map updates .......................................................................................................................................... 15

3. Functional System Requirements for ADAS .................................................................................. 16

3.1 Car Configuration .................................................................................................................. 16

3.1.1 Requirements .................................................................................................................... 16

3.2 Measurements....................................................................................................................... 18

3.2.1 Measurement Package ..................................................................................................... 19

3.2.2 Measurement Frame ......................................................................................................... 20

3.2.3 Measurement Key-frame ................................................................................................... 22

3.2.4 Measurement fragment: frame sequence ......................................................................... 22

3.3 Scenes and Scenarios .......................................................................................................... 23

3.4 Annotations ........................................................................................................................... 24

3.4.1 Region-based Annotation .................................................................................................. 24

3.4.2 Road Reconstruction ......................................................................................................... 25

3.4.3 Functional Annotation ....................................................................................................... 25

3.4.4 Annotation Data Aggregation from multiple measurements ............................................. 27

3.4.5 Levels of Annotation Automation ...................................................................................... 28

3.4.6 Annotation Process & Quality Assurance ......................................................................... 29

4. Technical System specifications .................................................................................................... 30

4.1 General Architecture ............................................................................................................. 30

4.1.1 Infrastructure layer ............................................................................................................ 31

4.1.2 Platform layer .................................................................................................................... 31

4.1.3 Application layer ................................................................................................................ 33

4.2 Cloud Infrastructure ............................................................................................................... 37

4.2.1 Physical specifications ...................................................................................................... 38

4.2.2 Interface description .......................................................................................................... 40

4.3 Software Components ........................................................................................................... 40

D1.2 V1.0

5

4.3.1 Web front-end .................................................................................................................... 41

4.3.2 Annotation engine ............................................................................................................. 41

4.3.3 Dataset engine .................................................................................................................. 42

4.3.4 Search engine ................................................................................................................... 43

4.3.5 Analytics engine ................................................................................................................ 44

4.3.6 Upload engine ................................................................................................................... 45

4.3.7 Tools engine ...................................................................................................................... 45

4.3.8 Pipeline engine .................................................................................................................. 45

4.4 Communication and Data Format ......................................................................................... 46

4.4.1 Physical specifications ...................................................................................................... 46

4.4.2 Interface description .......................................................................................................... 46

4.4.3 Annotation format .............................................................................................................. 47

4.5 Scene Recording Module ...................................................................................................... 52

4.5.1 Recorder capabilities ......................................................................................................... 53

4.5.2 Data compression ............................................................................................................. 53

4.5.3 Physical specification for sensors ..................................................................................... 54

4.5.4 File formats and chunking ................................................................................................. 55

4.5.5 Automatic annotations on in-vehicle embedded platforms ............................................... 56

4.6 Middleware and SDK’s .......................................................................................................... 58

4.6.1 RTMaps ............................................................................................................................. 58

4.6.2 Computer vision and Machine learning SDKs .................................................................. 59

4.6.1 Localisation Layers............................................................................................................ 60

5. References ..................................................................................................................................... 62

D1.2 V1.0

6

List of Figures

Figure 1: Pedestrian on sidewalk. ......................................................................................................... 12 Figure 2: Stopped-Slow-Braking. .......................................................................................................... 12 Figure 3: Pedestrian on Roadway. ........................................................................................................ 12 Figure 4: Overtaking & Narrow passage-detection free space. ............................................................ 12 Figure 5: Realistic combinations – Pedestrian to car. ........................................................................... 13 Figure 6: Realistic combinations – Car to car. ...................................................................................... 13 Figure 7: Traffic sign localisation .......................................................................................................... 14 Figure 8: Lane markings and road classifications ................................................................................. 15 Figure 9: Optical flow. ........................................................................................................................... 15 Figure 10: Valeo test vehicle. ................................................................................................................ 16 Figure 11: Subject vehicle datum reference frame. .............................................................................. 17 Figure 12: Measurement package. ....................................................................................................... 18 Figure 13: Logical view of measurements. ........................................................................................... 20 Figure 14: Frames / Time. ..................................................................................................................... 21 Figure 15: Asynchronous nature of measurements data. ..................................................................... 21 Figure 16: Measurement key frame on pedestrian crossing. ................................................................ 22 Figure 17: Measurement fragments. ..................................................................................................... 23 Figure 18: Images and Video annotation regions. ................................................................................ 24 Figure 19: Road reconstruction. ............................................................................................................ 25 Figure 20: Automatic Emergency Braking example. ............................................................................. 25 Figure 21: Functional annotations for pedestrian crossing. .................................................................. 26 Figure 22: Functional annotations for lane change (ISO 17387:2008). ................................................ 26 Figure 23: Pedestrian crossing annotations [1]..................................................................................... 27 Figure 24: Annotations on different cameras (top: side-mirror camera, bottom: front camera). ........... 27 Figure 25: Scenes vs fragments (top: side-mirror camera, bottom: front camera). .............................. 28 Figure 26: Degrees of annotation automation. ..................................................................................... 28 Figure 27: Semi-automated annotation process. .................................................................................. 29 Figure 28: The cloud stack. ................................................................................................................... 30 Figure 29: IBM Bluemix is a PaaS that works on the top of IBM SoftLayer IaaS ................................. 31 Figure 30: Docker and Docker-compose will be used in Cloud-LSVA development stage. ................. 32 Figure 31: Kubernetes: (left) cluster example; and (right) Kubernetes architecture where containerised applications run inside pods, alongside with related volumes. One of multiple pods run inside one node (VM).............................................................................................................................................. 33 Figure 32: Diagram of the reference architecture. ................................................................................ 34 Figure 33: Cloud-LSVA Beta configuration ........................................................................................... 39 Figure 34: Search Engine Framework .................................................................................................. 43 Figure 35: Pseudo-UML diagram of VCD structure: Managers are the main orchestration tool of Elements. .............................................................................................................................................. 48 Figure 36: Object is a special case as it holds its content as a ObjectDataContainer that can contain heterogeneous ObjectData such as bbox, string, polygon, etc. ........................................................... 48 Figure 37: The SCD encapsulates all the information of a recording session, including calibration files, and pointers to the different VCD files produced for the different sensors. Static content (e.g. list of recordings) reflects bibliographical information about the recording itself; dynamic content contains live information from the annotation process (e.g. which VCDs refer to the current SCD, or which is the association between labelled Elements across VCDs). .................................................................. 51 Figure 38: Objects in different views can correspond to the same object in the real world. The SCD allows to identify these relationships and be used to transfer annotations from one view to another, or to enhance queries against annotated content. .................................................................................... 52 Figure 40: Data acquisition and upload process ................................................................................... 54 Figure 41: File chunking ........................................................................................................................ 56 Figure 42: Example of SLAM features .................................................................................................. 60 Figure 43: RoadDNA example. ............................................................................................................. 61

D1.2 V1.0

7

List of Tables Table 1: Main Cloud-LSVA modules. .................................................................................................... 35 Table 2: Additional engines and elements provided by existing technologies. ..................................... 37 Table 3: Bare metal specifications and cost per node (Q1 2018) ......................................................... 39 Table 4: Annotation services. ................................................................................................................ 42 Table 5: Data engine services. ............................................................................................................. 43 Table 6 Search Engine Services ........................................................................................................... 44 Table 7: Analytics services for video annotation tools. ......................................................................... 44

D1.2 V1.0

8

Executive Summary

The aim of the Cloud-LSVA project is to develop a software platform for efficient and

collaborative semiautomatic labelling and exploitation of large-scale video data solving

existing needs for ADAS and Digital Cartography industries.

Cloud-LSVA will use Big Data Technologies to address the open problem of a lack of

software tools, and hardware platforms, to annotate petabyte scale video datasets, with the

focus on the automotive industry. Annotations of road traffic objects, events and scenes are

critical for training and testing computer vision techniques that are the heart of modern

Advanced Driver Assistance Systems and Navigation systems. Providing this capability will

establish a sustainable basis to drive forward automotive Big Data Technologies.

As part of the Cloud-LSVA project, WP1 ensures the establishment of the basis and

reference documents and procedures to be followed during the subsequent RTD tasks. The

definition activities to carry out are being deployed following the iterative development plan

explained:

• A detailed definition of user requirements collected from the end-user partners and

advisors.

• Description of the use-cases and related test scenarios, which leads to the definition

of functional requirements of the system.

• A detailed definition of legal, ethical, standardisation and economical requirements

and restrictions to be considered.

• To design the Cloud-LSVA reference architecture to establish a technical

specification of the entire system.

• Description of the HW and SW components to be used during the project execution

using as basis the reference architecture, the cost and legal restrictions and the

expected interfaces and deployment platforms.

• To establish an appropriate and feasible development lifecycle and a monitor

procedure of the compliance of the defined architecture and interfaces during the

project lifespan. This is performed by the project’s Architecture Review Board (ARB).

Additionally, as a critical milestone, the specifications must include the definition, creation or

extension of existing ontologies for metadata description in the context of the selected

scenarios.

D1.2 V1.0

9

1. Introduction

1.1 Purpose of Document

This document is a report containing a description of specifications of the system

requirements and a general view of the architecture of the Cloud-LSVA platform.

The main purpose of the document is to be used as a reference to consult basic design

information about the Cloud-LSVA platform. Namely, its aims are:

To provide a list of potential use-cases and test scenarios (section 2).

To define functional systems requirements (section 3).

To prescribe the car calibration-configuration (section 3.1) and the measurement

fragments (section 3.2)

To present the agreed understanding and requirements on data annotation (section

3.4).

To design the Cloud-LSVA reference architecture to establish a technical

specification of the entire system (section 4.1).

The SW and HW components that compose the Cloud-LSVA platform are described along

section 4.

This document is the final version of the series of deliverables on requirements,

specifications and reference architecture (D1.1 and D1.2). A preliminary version was created

in M3 (March 2016), containing a basic description of use cases and functionalities.

Subsequent versions were created until M6 (June 2016), to correspond to the agreed

functionalities, SW and HW platforms, etc. During the first integration period (from M9 to

M12), this document was used as a reference to actually start integrating components. After

the Cloud-LSVA Alpha prototype has been built, during M13 the consortium have created

this consolidated version of D1.1. Finally, during the second integration period, the Cloud-

LSVA Beta Platform was built, integrating all developments carried out during the second

year. After this integration, at month M25, a revision of the test report on the platform was

used to create this final version of the requirements, specifications and reference

architecture, to be used as guidelines during the last year of the project.

1.2 Intended audience

The audience of this deliverable D1.2 is open to the general public, and will serve as an

overview of the proposed technologies that define the Cloud-LSVA platform.

1.3 Related documents

This document shall be read along with some other documents that complement some of its

content, or where some references might be found.

Cloud-LSVA Technical annex: this is the original description of the Cloud-LSVA

platform, to be used as a reference and guide of general objectives.

D1.1 Requirements, specifications and reference architecture: first version of this

deliverable, used as guideline during the second cycle of the project.

D5.4 Cloud-LSVA Beta prototype (report): this document contains a report of the

D1.2 V1.0

10

technical activities carried out during the second iteration of the project (2nd year).

D2.1 Specification of vehicles architecture and on-board software and

hardware for recording and real-time tagging: this document extents the content

provided regarding the SW and HW components used for recording test scenes.

D3.1 Import/export interfaces and Annotation data model and storage

specification: this document contains detailed information about data formats.

D1.2 V1.0

11

2. Use cases description

The following two sections describe the main two application domains where the Cloud-LSVA is applied: the Advanced Driver Assistance System (ADAS) and the Digital cartography use cases.

2.1 Annotation for ADAS like application

One of the aims of Cloud-LSVA is to provide support for annotation tasks irrespective of a particular functional use case. This support could take of the form of fully-to-partially automated workflows for off- and on- board processing, online tool support for user-driven annotation, as well as methods and processes to render the tasks of annotation, machine learning, etc., manageable over very large datasets of measurements. A set of exemplary use-cases will be used to help construct the framework required to support the annotation effort. In the automotive context of Advanced Driver Assistance Systems, two categories of use-cases are presented: the first category involves vulnerable road users i.e. pedestrians, and a vehicle in diverse situations. The second category centres on multiple scenarios around different vehicles involved in road traffic situations. The objective is to have a good spread of examples and measurements involving longitudinal as well as lateral control situation of the subject vehicle involved. All uses-cases are spread across the three implementation phases in the Cloud-LSVA project to reflect an increasing level of complexity. Please note that no assumptions are made about ADAS systems being testing during the data acquisition process; it is assumed that acquisition of raw data is done under naturalistic driving condition with no ADAS support. For the definition of use cases, the following two concepts are defined: (i) Vulnerable Road User Category: the focus is primarily set on vulnerable road users such as pedestrians or cyclists; (ii) Vehicle Traffic Category: the use cases involve primarily vehicle situations.

D1.2 V1.0

12

2.1.1 Phase 1: On Sidewalk / City Inter-urban

In Phase I, the proposed use cases are borrowed from AEB NCAP car-to-pedestrian scenario, where the pedestrian is assumed to initiate road crossing from the sidewalk (near side) irrespective of occlusion state. In both situations, it is assumed that the driver of the vehicle is braking to avoid collision.

Figure 1: Pedestrian on sidewalk.

The proposed use-cases focus primarily around braking situations where the target vehicle may be stopped, slow moving or braking (similarly to NCAP car-to-car-rear scenarios). No assumption is made about the location of the vehicles (Urban, inter-urban, etc.).

Figure 2: Stopped-Slow-Braking.

2.1.2 Phase 2: On Roadway / City Inter-urban

In Phase II, pedestrians/cyclists are assumed to be on the roadway: either walking along the road in the direction of travel or crossing from the far side. For the first case, collision avoidance can either be done by steering or by braking; both maneuvers are also possible.

Figure 3: Pedestrian on Roadway.

The focus moves towards obstacle avoidance in terms of full or partial lane change. In the first case, situations, where the subject vehicle needs to carry an overtaking manoeuvre are considered. In the latter case, the subject may need to detect free space to avoid, for instance, construction work.

Figure 4: Overtaking & Narrow passage-detection free space.

D1.2 V1.0

13

2.1.3 Phase 3: Realistic Combinations City / Highway

In Phase III, the above-mentioned use cases can be merged into a single use-case. The objective here is to ensure that it is possible to annotate a complex situation.

Figure 5: Realistic combinations – Pedestrian to car.

As shown in Figure 6, complex multi-lane scenarios should be considered.

Figure 6: Realistic combinations – Car to car.

2.2 Annotation for Cartography generation

For Tom Tom, “Lane-level” navigation is an incremental step from road-navigation technology for infotainment systems towards cooperative-navigation technology for highly automated driving (HAD). It is essential to keep the map in sync with reality for lane-level navigation and even more for automated driving. Conventional map making techniques no longer meet the right level of freshness. Crowd sourcing technologies are explored as a new way of map making. These implement a near real-time loop of updating the on-board map on the basis of deviations of this map with real-time information of ca’s exo-sensors (i.e. camera, radar, LIDAR). The changes are committed to the back office to improve the next map. The process to produce map updates is automated with map-object classifiers that can run in the cloud environment for map production and can run in the vehicle system for lane positioning. Highly accurate annotation of pictures or videos is key to train the map object classifiers and achieve the goal of an ever-improving map.

D1.2 V1.0

14

2.2.1 Phase 1: Navigation prototype with a closed loop for traffic signs

This use case demonstrates an upgrade of an off-the-shelf navigation device with crowd-sourcing software that detects `speed-limit-sign` shapes in an area targeted to detect specific map updates. This traffic-sign shape is positioned and sent to the cloud system, which classifies the traffic sign in an automatic process. If sufficient observations are to create evidence that a speed-limit sign in the reference to all TomTom devices in the field. The evidence is stored in a historic database which can be used to train new algorithms. The navigation system will implement lane positioning for highway situations by tracking lane markings with the camera and matching the car position to the map. It implements an android widget providing lane guidance advices, based on the lanes available in the static map. To validate the prototype results a highly accurate ground truth is required. This could be annotated data with precise positioning.

Figure 7: Traffic sign localisation

2.2.2 Phase 2: Road markings and type classification

For this prototype, specific hardware is developed for running object classifiers within environment perception software and is connected to the IoT (Internet of things) infrastructure. It connects to the in-vehicle navigation application sub-system via wireless communication; it connects to the IoT infrastructure via V2X and cellular communication. The prototype will demonstrate improved lane navigation functionality based on fresh map data. That requires a new extended map format with lane attributes in the map that will be kept fresh on the basis of automated processing of crowd sourced data. To validate the prototype results a highly accurate ground truth is required. This could be annotated data with precise positioning.

D1.2 V1.0

15

Figure 8: Lane markings and road classifications

2.2.3 Phase 3: Lane navigation prototype for HAD cars with closed loop for incremental map updates

In this prototype, more advanced lane positioning concepts will be integrated to support lane navigation, also in urban and rural environments by fusion of traditional positioning sensors with inputs from several visual sensors in the car and special map products for HAD cars like RoadDNA or crowdsourced point clouds. In this prototype, an incremental map update is demonstrated based on the SLAM principle. Road geometry observations of different cars of a targeted mining area are processed to an incremental map update applying SLAM techniques.

Figure 9: Optical flow.

By using the above techniques, we can optimise the algorithms from both phase 1 and phase 2. That will result in an even more precise positioning of traffic signs and the position of the car on the road and its 3d environment. This prototype requires highly precise positioned landmarks.

D1.2 V1.0

16

3. Functional System Requirements for ADAS

3.1 Car Configuration

Data acquisition process requires that all sensors are mounted and fixed to the subject vehicle. Additionally, data logging equipment is required to record the measurement to be annotated. To make sense of the recorded data it is necessary to provide an annotation system with the actual position and orientation of the sensors.

Figure 10: Valeo test vehicle.

The same goes for recorded vehicle bus messages, which need to be converted back to scalar values. As multiple partners in the Cloud-LSVA project will make recordings, it is important that a common configuration format be defined to ensure smooth operation of annotation process.

3.1.1 Requirements

Requirement: The Cloud-LSVA system shall provide a common car configuration open format to help define the various sensor/device setups in the subject vehicle. Rationale: The calibration file for the sensors (Velodyne and camera) will be provided for every recording. Requirement: The Cloud-LSVA car configuration format shall enable unique identification of the subject vehicle. Rationale: A subject vehicle used for data acquisition may be re-configured during their lifetime at the test centres. It is important for processing reasons to clearly identify the vehicle for which the configuration file has been produced for. This identification may be a VIN or proprietary serial number for the vehicle.

D1.2 V1.0

17

Requirement: The Cloud-LSVA car configuration format shall be used to describe the naming, position and orientation (frame of reference) of all sensors and reference systems on the subject vehicle with respect to a datum reference frame on the vehicle. Rationale: A document will be provided along with the recorded file every time, the measurement and the location of the sensors. Car coordinate system will be mentioned in this document. Requirement: The Cloud-LSVA car configuration format shall define the datum as the intersection point on the rear axle with the vehicle centreline. Rationale: The distance from the back wheel axis to the Velodyne sensor is provided with the X, and Z directions as shown in Figure 8. Any changes in the mounting positions will be measured again and updated. Requirement: The Cloud-LSVA datum axis system shall follow the orientation of the axis system as defined per ISO 8855. Rationale: This axis system represents a right-handed orthogonal system of axes and determines the sense or orientation of the various vehicle motions; e.g. longitudinal (x), lateral (y) and vertical (z) translations, and roll (φ), pitch (θ), and yaw (ψ) rotations.

Figure 11: Subject vehicle datum reference frame.

Requirement: The Cloud-LSVA car configuration format shall provide the possibility of linking or embedding the intrinsic calibration of every sensor present in the subject vehicle. Rationale: Intrinsic calibration data is needed to ensure proper interpretation / correction of the measurements. Requirement: The Cloud-LSVA car configuration format shall provide the possibility of linking or embedding the necessary databases to help decode vehicle messages recorded during data acquisition Rationale: The data recorded is the data from reference Velodyne Sensor and camera sensor and the recording would be done using RTMaps software. The CAN or the FLEXRAY data is not recorded. Requirement: The Cloud-LSVA car configuration format shall enable the specification of the vehicle dimensions with respect to the datum reference frame. Rationale: Any changes in the positions of the sensors would be updated.

D1.2 V1.0

18

Requirement: The Cloud-LSVA car configuration format shall enable the tagging of devices as either product sensor or reference sensor. Rationale: A reference sensor is a device that is used by engineers to support the engineering design effort and will not be used in the final product, e.g., RTK DGPS, or LIDAR range finders. Requirement: The Cloud-LSVA car configuration format may enable the specification of the vehicle physical properties, such as centre of gravity, inertial moments, etc., with respect to the datum reference frame. Rationale: The specifications of the vehicle with the senor mounting positions are provided. Requirement: The Cloud-LSVA car configuration and linked files shall be bundled into a container with the proper manifest, or follow an open packaging standard. Rationale: This requirement is to ensure that no linked file is missed during a copy of the configuration.

3.2 Measurements

It is assumed in this document that measurements will be obtained primarily from data logging equipment installed in the subject vehicle, which ensures that the measurement data records are correctly logged with the appropriate timestamp.

Figure 12: Measurement package.

This section focuses primarily on requirements needed for the annotation process ranging from the bundling of measurements down to the indexing and delimitation of the fragments of measurement to support annotation.

D1.2 V1.0

19

3.2.1 Measurement Package Requirement: The Cloud-LSVA project shall define or select a container format (Measurement Package) to store all relevant data pertaining to a data collection run. Rationale: A recording from the vehicle system may involve multiple files; the rationale with this requirement is to keep a recording packaged so that we can handle the measurements as a single logical package. Requirement: The Cloud-LSVA Measurement Package shall provide a manifest listing all the content available in a given Package. Any content not listed in the manifest shall be ignored and/or discarded. Rationale: This requirement is to ensure that all files pertaining to the measurements are available; it provides a fast check to ensure that processing can be executed in a cost effective way. Requirement: The Cloud-LSVA Measurement Package shall enable fast listing of all relationships between content parts and/or content external to the package. Rationale: beyond a simple check of the content, it should be possible to clearly identify the categories of measurements available for processing: sensors, reference sensors, etc. An additional check can be done to ensure that data required and marked as external content can be accessed: for example, DBC or FIBEX files required to decode bus messages; this kind of data is vehicle specific and does not require to be maintain/copied with the measurements. Requirement: The Cloud-LSVA Measurement Package shall provide a central content part to list all package properties such as vehicle identification, date and time of data capture, test campaign identification, etc. Rationale: Like an office document, it should be possible to access the “summary” properties quickly without having to process a lot of data. Requirement: The Cloud-LSVA Measurement Package shall enable storage of the package in one or more physical files or blocks. Rationale: as recording can span multiple hours, it is best practice to split the recordings into blocks. In a cloud context, the storage system may be an object store rather than a file system; to enable fast upload/download of data, blocks should have an optimal size to reduce latencies. Requirement: The Cloud-LSVA Measurement Package shall provide URL-based scheme to access content stored in a remote package; in this case no assumptions need to be made about the physical storage of content. Rationale: even if the measurements will primarily be stored and accessible as files, the Cloud-LSVA platform will live primarily on a cloud environment, where access cannot be limited to conventional file systems. Requirement: The Cloud-LSVA Measurement Package shall provide the ability to tag content stored in a remote package with pre-defined or ad-hoc categories representing the purpose for generating the measurement. Rationale: Field testing can be done different purposes: “open road” or statistical driving, track testing of specific scenarios, etc. Requirement: The Cloud-LSVA Measurement Package shall be based as far as possible on open standards. Rationale: potentially a measurement package can be archived.

D1.2 V1.0

20

Figure 13: Logical view of measurements.

3.2.2 Measurement Frame In the context of a specific sensor measurement, a frame is uniquely defined by a combination of the sensor data, and the point-in-time the data was acquired. A discreet sequence of frames makes a measurement. For some sensors, timing and data of a frame are quite simple: for a camera, an image acquired at a given point in time. For other sensors this will be more complex; a LIDAR range finder rotating over a vertical axis will generate a discreet scan (firings) of the environment at different time intervals. From a processing perspective, a full rotation of the device would be required. A frame in this case would cover a time interval, traditionally would be assigned a time point.

D1.2 V1.0

21

Figure 14: Frames / Time.

Requirement: The Cloud-LSVA Project shall provide guidelines or standards to define data and timing of frames for different types of sensors. Rationale: This is to ensure that there is a common understand on what constitutes a frame for all devices. In terms of frame timing, no assumptions can be made that a measurement is assigned a fixed sample time. Even in video streams for ADAS systems, the video frame rate can be variable. Additionally, no assumptions can be made that timing of one measurement matches frame-by-frame the timing of other measurements. The frame timing provides a total order over the frames of a single measurement. Correct timing across measurements is assumed to be correct and provided by the Data Acquisition System. To ensure quick navigation across frames in a single measurement, a zero-based integer index is assigned to each frame.

Figure 15: Asynchronous nature of measurements data.

The basic properties of a measurement frame are its index, its timestamp and the sensor data.

D1.2 V1.0

22

3.2.3 Measurement Key-frame A measurement key-frame is a frame that is associated with some specify significant event. For example, a video key-frame in a pedestrian crossing situation may depict the instant when the pedestrian crosses a virtual boundary delimiting a roadside curb.

Figure 16: Measurement key frame on pedestrian crossing.

Any measurement frame can be tagged as a measurement key-frame by an annotator. The tag must follow clearly defined conventions.

3.2.4 Measurement fragment: frame sequence A measurement fragment is a continuous sequence of frames delimited by a start and an end time points, i.e., delimited by two frames. A complete measurement is then the maximum extension a fragment can have within a measurement. Measurement fragments subscribe to interval algebra such as Allen’s Interval Algebra. Measurement fragments can be manually created by an annotator or generated by an automated annotation system. For instance, a fragment may represent a road type segment such as motorway segment where the start and end frames are key-frames representing entry and exit motorway road signs; a fragment could also represent an object track for a target vehicle or pedestrian. Requirement: The Cloud-LSVA Project shall define and implement a set of operators to manipulate fragments, provide access to frames in the fragments. Rationale: The main objective to allow a set of standard operators with the same semantic to be used across the implementation of all Cloud-LSVA platform tools.

D1.2 V1.0

23

Figure 17: Measurement fragments.

3.3 Scenes and Scenarios In theatricals plays, a scene is understood as a setting where some action or event occurs, and is often linked to some specific place and actors; a scene has a beginning and an end. In an annotation context for ADAS development, a scene represents a situation of interest to engineers to help design, train, or validate systems. For example, in the context of pedestrian detection, engineers will be interested in scenes involving pedestrians (pedestrians crossing the road, or pedestrians walking on the sidewalk), as well as scenes characterised by the absence of pedestrians such as motorway sections with no pedestrians in sight. In terms of scenario, a scene may include different variants in the motion/behaviour of the different actors. In a pedestrian ADAS context, a pedestrian may be crossing the road on a pedestrian crossing or diagonally crossing the road with obvious ground markings; alternatively, a pedestrian may halt near the curb and wait for the subject vehicle to pass before crossing. In its simplest form, scenes are tags assigned to some manually defined fragments. A fragment may be assigned multiple scene tags depending on the actor/scenario focus to the scene definition. In the Cloud-LSVA platform, one of the objectives is to provide a more advanced scene understanding concept to help data-mine existing measurements for fragments subscribing to a particular specified scene and/or scenario rules. Requirement: The Cloud-LSVA Project shall provide a clear definition to what constitutes a scene. Rationale: simple tagging is not sufficient if one tries to enable automatic recognition of scenes. An annotator or developer should be able to define the primary characteristics of scenes he/she is looking for.

D1.2 V1.0

24

Requirement: The Cloud-LSVA Project shall explore the feasibility of mining scenes from annotated measurements. Rationale: automatic recognition of scenes is not a simple problem to solve. Requirement: The Cloud-LSVA Project shall provide a methodology and supporting tools to clearly define an ontology required to annotate and to define scene components. Rationale: NCAP test scenarios are a good example: initial focus was done on recognition of adults; the scenarios are evolving towards inclusion of children detection. An ontology service for Cloud-LSVA must provide the ability to evolve with the standards without required a costly re-adjustment of data in the cloud.

3.4 Annotations This subsection contains the description of the agreed definition of annotations. It will present a number of different types of annotations, which span the identified functional requirements, including multiple views, annotation automatisation, etc.

3.4.1 Region-based Annotation A standard approach to the manual annotation of video sequences is to delineate a region or patch of pixel and assign a label to the region for each frame. A region may take the form of a polygon or closed polyline (Figure 18: a, b, c) or a rectangle (Figure 18: d). The regions may delineate several types of regions described with one or more labels. In Figure 18a, three different labels are used to characterise the state of the lens in terms of visibility: Soiled, Blurred or Clear. This type of annotation is required to help determine the camera state. In Figure 18b&d, each polyline/rectangle is labelled as car object along with its identifier as text or colour. In Figure 18c, regions are defined at pixel-level and the overlaid colours represent the class of object: e.g. blue for cars, pink for sidewalk, purple for the road.

Figure 18: Images and Video annotation regions.

Requirement: The Cloud-LSVA Project shall capabilities to automatically/manually position/generate regions onto video frames. It shall be possible to manually modify point placements. Rationale: Annotation of regions in images and video.

D1.2 V1.0

25

3.4.2 Road Reconstruction Road reconstruction is necessary to determine the qualitative position of objects, as well as, help determine which objects need monitoring for the ADAS functions (Car in front for Autonomous Emergency Braking).

Figure 19: Road reconstruction.

Requirement: The Cloud-LSVA Project shall capabilities to automatically segment road structures from camera data and possibly automatically label the detected regions. Rationale: Determination of qualitative position of objects and which objects need monitoring for the ADAS functions.

3.4.3 Functional Annotation Functional annotations are annotations that go beyond the simple determination of pixel regions from detection and are dependent of the type of advanced driver assistance system function being developed. As a consequence, functional annotations are pegged to a particular type of scene and particular function. As shown in Figure 20, functional annotations are based on observable measures extracted from the image/video material, such as the headway distance to the front vehicle, or relative velocity between two vehicles, from which dependent variables can be computed from, such as time-to-collision, or time-to-brake indicators. Observable measures are usually time-dependent, and will vary from frame to frame.

Figure 20: Automatic Emergency Braking example.

D1.2 V1.0

26

In some cases, as shown in Figure 21 and Figure 22, support geometry needs to be introduced; in Figure 21, a curb line (in red) is overlaid in world coordinates to mark the boundary between sidewalk and road lane, and serves as a transition boundary. The curb line is invariant in world terms across all video frames in the scene. In Figure 22, zones 1 and 2 as well as the warning line (yellow) are positioned relative to the ego vehicle. The areas of the zones as well as the warning line position, change proportional to the ego-vehicle velocity; hence frame for frame.

Figure 21: Functional annotations for pedestrian crossing.

Figure 22: Functional annotations for lane change (ISO 17387:2008).

Lateral, longitudinal and Euclidean distances from the ego vehicle to other vehicles and/or pedestrians can be measured and used to compute different types of indicators as in the AEB example (Figure 20).

D1.2 V1.0

27

Figure 23: Pedestrian crossing annotations [1].

In addition to geometric aspect, functional annotations also extend to individual object attributes that may be necessary to analyse specific situations. Figure 23 illustrates some aspects in the context of pedestrian protection. Important characteristics of pedestrians are: the actual class, child, adult or elderly; the action being performed by the pedestrian (walking, running, still); the direction of travel in case of movement; and head orientation. In the case of vehicle, odometry and direction of travel are also required. This additional information is then used to either compute dependent variables or establish statistics around behavioural patterns exhibited by the various actors.

3.4.4 Annotation Data Aggregation from multiple measurements Traditionally, annotations are produced for a single sensor; as multiple cameras may be embedded in the test vehicle, it is important that the individual annotated regions (on different cameras) belonging to the same objects are aggregated together. Each camera has a different position and direction and the field of view (FOV) will be different. Hence an object leaving the FOV of one camera may still be seen as another sensor (e.g., in Figure 24 on side mirror camera, the parked grey car is visible, but not on the front camera).

Figure 24: Annotations on different cameras (top: side-mirror camera, bottom: front camera).

D1.2 V1.0

28

A scene can span different sensors; in a pedestrian crossing context (Figure 25), the pedestrian may appear first on front camera and then become visible on the right mirror camera. The annotated fragments for the scene will register the poses of the pedestrian; all events and actions are expected to be in synch, but the fragment may be of different durations and / or start/end time-points.

Figure 25: Scenes vs fragments (top: side-mirror camera, bottom: front camera).

3.4.5 Levels of Annotation Automation Currently, annotations are performed manually by a horde of workers. Although tools exist to provide support to annotation, productivity gains are still too low to enable large scale video annotation. As a result, only a relatively small amount of acquired data can be processed. This implies that the concept of annotation is tightly linked to the concept of ground-truth. Ground-truth are annotations that are considered to be “correct” and are used in training machine learning algorithms or used as test reference. Ground-truth is by its very nature bound to a quality assurance process. Current practices imply that manual workers produce ground-truth.

Figure 26: Degrees of annotation automation.

The ideal situation would be to automate the annotation generation, which itself relies on machine learning, and thus introduces false or missed detections into the annotations. The major advantage of automation is to ensure that most if not all the acquired measurements are processed and can then be indexed by annotation content even if partially incorrect. But, with the introduction of annotation automation, there must be a decoupling between annotation and ground-truth, whereby a path must be left open to “transform” annotations into ground-truth. Furthermore, annotation automation is limited by the detection functions made available to the platform. If new object classes have been identified for annotation, a horde of manual workers is then required to perform partial object annotation. Within this context, semi-automation of annotations (besides full automation) is also a major requirement to support different activities such as:

Correcting automatically generated annotations,

Annotating new object classes.

D1.2 V1.0

29

3.4.6 Annotation Process & Quality Assurance To ensure that annotations are of high-precision, a quality assurance process must be put in place to ensure compliance to a set of standard concerning the annotations. Whether the process is conducted manually or automated manner, in both cases a review process must be put in place to validate ground-truth. The particularity of automation is the added requirement that specific key performance indicators must be made available to ensure proper monitoring and assessment of annotators with regards to validating automated annotations: these kinds of indicators are called “laziness” indicators, and refer to anticipate behaviour of an annotator when reviewing / correcting annotations coming from a detection function.

Figure 27: Semi-automated annotation process.

D1.2 V1.0

30

4. Technical System specifications

This section contains the core contribution of this deliverable, including a description of the

reference architecture of Cloud-LSVA, a list of SW components, the defined data formats,

the scene recording module, and identified 3rd Party and SDKs to be used during the

development stage.

4.1 General Architecture

One of the major aims of the Cloud-LSVA platform is to provide services and applications to

end users for annotation process. The software platform should run a cloud environment. To

ensure that each partner can develop, deploy and run the application, a cloud agnostic

environment has been selected.

Figure 28: The cloud stack.

Figure 28 shows the system stack devised for Cloud-LSVA. At the top, the Cloud-LSVA

system will be built as a Software or Application layer, where all functionality resides, and

where all modules intercommunicate using standard communication channels (e.g. RESTful

web services). The Platform layer below is composed of a set of engines that represent

technologies that can take care of scaling, optimising, deploying the Cloud-LSVA functions

while automatically managing the underlying resources. At the bottom, these resources

(computation, networking, storage), are represented by the Infrastructure Layer.

D1.2 V1.0

31

4.1.1 Infrastructure layer In terms of infrastructure, and for development and testing purposes, each partner should be

able to decide how to implement the environment. For the integrated Cloud-LSVA system,

IBM will provide the appropriate infrastructure (i.e. IBM’s SoftLayer) described in a later

section (section 4.2).

4.1.2 Platform layer

In terms of platform, the main principal objective is to define a set of basic platform

functionalities to handle core services, such as, web application server, analytics tool

deployment and orchestration. Different technologies might be applicable for each

functionality. The following paragraphs summarises the considered technologies that might

be applicable at Platform level:

IBM Bluemix

IBM Bluemix is a Platform as a service (PaaS) cloud, developed by IBM and based on Cloud

Foundry. It supports many programming languages and services, and also DevOps to build,

run, deploy and manage applications on the cloud. It runs on IBM’s SoftLayer infrastructure.

Figure 29: IBM Bluemix is a PaaS that works on the top of IBM SoftLayer IaaS

Considering that the IBM SoftLayer IaaS has been selected for the integration of the Cloud-

LSVA prototypes (as described in deliverable D2.1 and section 4.2), the IBM Bluemix PaaS

must be considered as a candidate PaaS to be used in latter development stages of the

Cloud-LSVA project, when SW automation and scaling might be considered as features to

D1.2 V1.0

32

be tested. Its utilisation is as well subject to budget resources and licensing agreements.

Docker

Docker1 is an open-source technology that automates the deployment of applications inside

software containers. Different modules of the Cloud-LSVA application, including computer

vision or machine learning, can be developed by partners and then encapsulated in the so-

called Docker images, which can be shipped and stored in stores, such as the Docker Hub2

(local Hubs can be created as well). Then, Docker images can be taken and instances of

them can be executed, in the form of the so-called Docker containers. The execution of the

containers is managed by the Docker Engine.

Figure 30: Docker and Docker-compose will be used in Cloud-LSVA development stage.

Docker can be used to package CUDA-enabled heavy algorithms using computer vision or machine learning libraries.

Docker-compose is a tool for defining and running multi-container Docker applications, by

means of creating a configuration file that define the containers to launch, the volumes to

define, and the commands to manage the whole lifecycle of the application.

Kubernetes

Kubernetes is an open-source orchestration framework for automating deployment, scaling,

and management of containerised applications. Kubernetes is designed to work in multiple

environments, including bare metal, on-premises VMs, and public clouds. Kubernetes only

needs the applications to be containerised in a supported format, such as Docker images.

Kubernetes has a number of key features. It automatically places containers based on their

resource requirements and other constraints, while not sacrificing availability. It also restarts

containers that fail, replaces and reschedules containers when nodes die, and kills

containers that don't respond to user-defined health checks. Scaling and upgrading

applications is simple and nearly transparent to the user, as Kubernetes handles new

resources and connectivity among replicas, while making sure the service is up and running

1 https://www.docker.com/ 2 https://hub.docker.com/

D1.2 V1.0

33

during the process. Also, Kubernetes takes care of properly balance load over groups of

related containers.

Given all these features, Kubernetes is being widely used in cloud deployments.

Configuration is easily customisable through YAML files, which control the basic layout of the

containerised applications and their relations in terms of Kubernetes items (pods, nodes,

volumes, services, etc.).

Figure 31: Kubernetes: (left) cluster example; and (right) Kubernetes architecture where containerised applications run inside pods, alongside with related volumes. One of multiple pods run inside one node (VM).

4.1.3 Application layer Finally, a reference architecture for the Software Level is then defined based on this platform with core modules listed in Table 1, and list of exposed core services listed in Table 2. Figure 32 shows an illustrative diagram of the identified Cloud-LSVA engines (also referred to as “Modules” or “Managers”), with a separation between the back-end and front-end layers, the underlying infrastructure resources (compute, data, stores, etc.), and the core services exposed across the platform. This conceptual view ignores the type of technology to use, and the specific Platform and Infrastructure used.

D1.2 V1.0

34

Figure 32: Diagram of the reference architecture.

In general terms, the Cloud-LSVA system is a cloud-based system that exposes a number of

functionalities related to the annotation of large volumes of data coming from sensorised

vehicles.

In basic terms, there are four main elements around the Cloud-LSVA system:

Data: in the form of video/sensor information recorded from equipped vehicles

(RTMaps works as the main mechanism to manage data at the vehicle (for

recording) and at the cloud (to retrieve data streams); while VCD is the defined

language for annotations and metadata).

Front-end: the (web) interface of the system to the human users of the platform,

which exposes services and functionalities to perform actions (e.g. annotation videos,

training models, etc.).

Back-end: the core SW engines that provide the underlying functionality of the

system (e.g. learning, deploying algorithms, storing data, formatting annotations,

etc.).

Cloud resources: the infrastructure that enables the functionalities, including

storage resources (NAS system), annotation databases (e.g. MongoDB), tools store

(i.e. Docker Registry), and computing resources (e.g. GPU-enabled servers).

There are two main users of Cloud-LSVA:

Annotators: operators that access Cloud-LSVA to perform annotation tasks through

a GUI, such as identifying objects in images, time lapse with recognised actions, etc.

Engineers: trained personnel expert in ADAS system and computer vision and/or

deep learning technologies, that use Cloud-LSVA to manage datasets, and to create

annotation tasks. Engineers also have uploading functionalities: the content collected

D1.2 V1.0

35

from sensorised vehicles must be uploaded to the cloud-side storage for its analysis,

and must be monitored and controlled by the platform.

Intentionally, the Cloud-LSVA will offer a common GUI through a web application, which will

work as the front-end of the system. The implementation of this Web App Engine can be

tackled using a variety of technologies (Angular, Polymer, etc.). The front-end provides the

access to the different functionalities offered by the back-end.

The back-end of Cloud-LSVA is basically composed by the SW engines that provide the

functionality of the system, which relies on the HW systems where the Cloud-LSVA platform

is deployed (including the storage of raw content from sensors, and the computing clusters

where the SW is executed). The SW part is, therefore, composed of a number of modules, in

the form of web applications (e.g. Web Application Archives) which define functions and

REST interfaces for interoperability (details are provided in tables below and in section 4.3).

Cloud-LSVA can then be used to3:

to annotate spatio-temporal annotations on multiple synchronised video streams.

to launch automatic annotation processes on subsets of video footage.

to load existing annotations and perform operations (verify, correct, detail, etc.)

to upload new content into the storage repositories.

to evaluate the performance of a given algorithm against an annotated dataset.

To do so, the Cloud-LSVA application layer requires the implementation of the following

modules and preliminary list of exposed services:

Table 1: Main Cloud-LSVA modules.

Engine / Element

Description Tooling*

Analytics Engine Creates recipes to launch detectors on datasets.

Returns annotations.

Communicates with Datasets manager to receive Data/Metadata

Viulib/OpenCV

Caffe/DIGITS/TensorFlow

Annotation Engine

Communicates with Datasets Manager to access Metadata

Creates/Updates/Merges annotations from different sources (e.g. automatic or manually generated).

Compares annotations to create evaluation reports

Viulib VCD

Viulib Evaluator

Dataset Engine Manage and browse the measurements datasets as well as fragments; datasets includes measurements, scene sets, training sets, …

RTMaps

MongoDB

PostgreSQL

3 An extended and detailed list of annotation use cases is provided in deliverable “D3.4 Video Annotation Tools”, and updated use cases are continuously being created by the consortium during the development tasks, and will be reported at the corresponding prototype reports (D5.4 and D5.5).

D1.2 V1.0

36

Engine / Element


Search Engine Execute short as well as long duration queries to the Cloud-LSVA system.

MongoDB

Elastic Search

Upload Engine Manage the upload and transformation of large datasets

RTMaps

SoftLayer Data Transfer Service

Web App Engine Provides GUI to users: annotation interface, training/engineer interface, app management interface

Handle authentification and security tokens

Angular

HTML-5

* Tooling: this is the list of selected technologies for the development of the identified module at the Cloud-LSVA Beta prototype. However, this document only presents a reference architecture, i.e. a definition of the functionalities and the expected interfaces. Therefore, changes may apply for the integration of the Gamma prototype (to be reported at deliverable D5.5).

Additionally, some of the engines and elements identified in Figure 32 directly relate to

existing technologies that the consortium (i.e. these modules need not to be implemented,

existing technologies already provide the required functionalities).

D1.2 V1.0

37

Table 2: Additional engines and elements provided by existing technologies.

Engine / Element


Tools Engine Manage and deploy analytics tools on the

compute cluster.

Manage the Tool store

Docker

Tools Store Repository of tools available for their

execution Docker Registry

Pipeline Engine Orchestrates execution of tools

Manage and monitor the compute cluster elastically.

Docker-compose Kubernetes

4.2 Cloud Infrastructure The infrastructure of the Cloud-LSVA platform is intended to facilitate the capture,

persistence, storage and simulation of very large video datasets in the cloud. Data can be in

the form of compressed or uncompressed multiple HD video streams and will be

accompanied by annotations (after processing). The Cloud-LSVA platform is hosted on IBM

Cloud (Formerly known as IBM SoftLayer, providing infrastructure as a service via the

Amsterdam datacentre.

The challenge is to provision a sufficient amount of storage and computing power in a timely

fashion to enable adequate storing and processing of the data, while keeping within the

budgetary constraints of the grant agreement. For the first phase, the requirement was to

work with a video data set of ~20TB in size, after each phase of the project, additional data

will be added to help evolve the platform. Cloud-LSVA began with a 22TB subset of the

available 1.2 petabytes of Video available to the project and is due to grow in this final year,

up to but not exceeding 140TB during the Gamma phase. This data is comprised of a variety

of vehicle sensor data, most of which has been supplied by Valeo and has been augmented

with cartography and annotation metadata.

For the Alpha phase of the project, there were many unknown unknowns as to what

requirements would be needed to build such a platform in the cloud, as such an extremely

flexible architecture was implemented upon bare metal servers and abstracted with various

levels of virtualisation. This approach negated the potential security pitfalls of having a multi

tenancy scenario, while also centralising the effort for securing the platform, which is a key

issue, considering that elements of video footage that may contain personally identifiable

information. It is important to note that annotation cannot work on obfuscated images as it

would render the detections useless.

During the Beta phase of the project, a GPU was added to the platform and it quickly

became clear that the required functionality of this device was not available whilst using a

virtualised platform, as such, a decision was made to migrate the existing platform away

from a hypervisor at the beginning of the Gamma phase. At the time of writing this report, the

migration process was currently underway, with the hypervisor still existing in the platform

D1.2 V1.0

38

and as such this will be detailed below in the physical specifications of the bare metal

servers.

4.2.1 Physical specifications

As the project enters the Gamma phase, many of the questions as to which algorithms are

going to be used for annotation, what file types are going to be used for the video files and

how will video files be processed and transferred (compressed or decompressed) have been

clarified after the successful workshops run during the Beta phase. As such the system

requirements have been built and shaped based upon the findings of the work completed in

both the Alpha and Beta phases.

For the Alpha phase, no GPU’s were required and all heavy processing was done using

CPU power, however as the Beta phase progressed, GPU’s became a necessity and

multiple options where considered. Previous iterations of this documented listed NVIDIA

Tesla K80 Graphic Cards as the intended GPU for the platform, however as technology

progressed, new cards were made available via the cloud platform provider, and these were

taken into account, namely the NVIDIA Tesla M60 Graphic Card and the NVIDIA Tesla P100

Graphic Card. At the time of provisioning, the NIVIDIA P100 card was the most powerful,

least complex to implement and the most economic choice (performance per € spend) for

the platform. It is also beneficial to note that the GPU choice matches that of the on premise

GPU’s used by Vicomtech, thus lessening the potential delays with development

compatibility and expected results. The resulting architecture as shown below in Figure 33

depicts the storage and processing nodes are separate servers entirely. This approach

allows for the horizontal and vertical scaling of the platform, with minimal impact on

interdependency between processing nodes. As more processing node are required, storage

remains static and is shared between all nodes in the platform, thus reducing the overall cost

of scaling the platform.

D1.2 V1.0

39

Figure 33: Cloud-LSVA Beta configuration .

Table 3 below lists the monthly costs for the existing virtualised (CLSVA-BM1) and GPU

ready server (CLSVA-GPU1) along with a summarised list of hardware specifications. This

table also lists the potential costs when adding a second GPU card to the existing machine

as the Gamma phase is intended to scale up to multiple GPU’s before the end of the project.

CLSVA-BM1 CLSVA-GPU1 CLSVA-GPU2

Server

Dual 2.6GHz Intel Xeon-Haswell (E5-2690-V3-DodecaCore)

Dual 2.4GHz Intel Xeon-Haswell (E5-2620-V3-HexCore)

2.4GHz Intel Xeon-Haswell (E5-2620-V3-HexCore)

Disk 2TB SATA configured as 1TB RAID1

2TB SATA configured as 1TB RAID1

2TB SATA configured as 1TB RAID1

RAM 128 GB RAM 64 GB RAM 64 GB RAM

Graphics Processing Unit N/A NVIDIA Tesla P100 NVIDIA Tesla P100

Secondary Graphics Processing Unit N/A N/A NVIDIA Tesla P100

Monthly cost per node $2,323.06 $1,473.43 $2097.23

Table 3: Bare metal specifications and cost per node (Q1 2018)

The storage platform for this environment is a bare metal server that has been provisioned

with 80TB of disk space over a RAID5 array using OSNEXUS’s Quantastor solution. The

RAID 5 setup allows for fast read times (annotation will be read heavy), and also has the

bonus that if a disk fails, the parity checksum on the remaining drives will be sufficient to

recalculate the data on another drive. The bare metal servers are fitted with 3Ware 9550SX

RAID controllers for SATA drives which can be configured for maximum performance. The

80TB of disk space would allow for a bare metal hypervisor to access the 220TB video

D1.2 V1.0

40

archive to be stored, and allow for multiple sandboxes to copy various scenes or scenarios

for annotation.

Data can be uploaded via public or private uplinks with speeds from 100Mbps up to 10Gbs

(currently all servers have 1Gbs), yet with such large data volumes encountered on the

Cloud-LSVA project, uploads over the internet are not the recommended course of action.

Instead it is quicker and cheaper to send a compatible device containing the required data to

the datacentre to be connected directly to their network, enabling the direct data transfer,

(this data transfer service is being offered free of charge to all IBM Cloud customers). Bear

in mind that although it is possible to upgrade the internet links of the Cloud-LSVA servers to

enable much faster uploads to the cloud, the senders uplinks must also has sufficient

bandwidth to facilitate this upload for it to be of any advantage. In such a case that uploads

over the internet were a thing, public internet traffic to the cloud provider are metered, with

the current package with the cloud provider (IBM Cloud) allowing 500GB of traffic before

metering metrics are applied. For the entirety of the project, the Cloud-LSVA cloud platform

has been based in IBM Clouds datacentre in Amsterdam, where 20TB bandwidth packages

can be purchased for $999 or leaving the existing package as is, with excess usage of

bandwidth charged at $0.09/GB USD. When considering the Alpha data upload of 22TB, any

attempt to upload vast amounts of data over the internet would have to be carefully balanced

4.2.2 Interface description

The administrators of the IBM Cloud account (Vicomtech & IBM Ireland) have a login to the

IBM Cloud Customer Portal, which grants tabbed access to all of the established devices,

storage, network, security and services and also giving a portal to the support areas. This

portal lists all of the initial hardware and software configurations of the provisioned servers,

whilst also supplying the connection information in the form of IP addresses, LAN

configuration and passwords. This portal also supplies the information and ability to connect

to the platform to perform remote management during maintenance windows by using the

Intelligent Platform Management Interface (IPMI).

There is also an Application Programming Interface (API) with SoftLayer, which is the

development interface that gives developers and system administrator’s direct interaction

with the backend system. The functionality exposed by the API allows users to perform

remote server management, monitoring and the ability to retrieve information from IBM

Cloud’s various systems. Generally speaking, any commands that can be run from IBM

Cloud’s Customer Portal GUI, can also be run via the API. For general access, the

consortium have been granted access via a predefined set of approved IP addresses (each

consortium members own private IP address range, supplied by their network team) and can

connect to the platform via secure channels such as VPN and SSH.

4.3 Software Components

This section details the functionality and services provided by each of the SW components of

the Cloud-LSVA system, as identified in section 4.1.3.

D1.2 V1.0

41

4.3.1 Web front-end

This is the module that creates a web-based Graphic User Interface (GUI) from which users

can interact with the Cloud-LSVA platform.

As previously defined, different types of users are expected: (i) annotators, (ii)

engineers/scientists, and (iii) system managers. Therefore, different type of interfaces will be

presented according to the user credentials to enable role-specific functions.

A great level of detail about this module can be found in deliverable “D3.2 Initial User

Interface SW for Automatic and Collaborative Video Annotation”, which gathers the

developments reached with respect the GUI at month 10 of the project. Additionally, a report

on the integration of the Web Web-app Engine and GUI can be found in section 4.2.2 of

deliverable “D5.4 Report on Cloud-LSVA prototype Beta”.

4.3.2 Annotation engine

The Annotation engine exposes services for reading, creating, updating and managing

annotations, in the form of Video Content Description (VCD) and Scene Content Description

(SCD) files or messages (see section 4.4.3).

For some services, the Annotation engine internally calls the Search engine to locate and

retrieve information from the annotation databases (e.g. getVCD).

In its simplest form, the annotation engine is only an interface to access the annotations in

the databases, but potentially, more functionality could be added to this module, such as

annotation merging, rating, automatic updating, etc.

D1.2 V1.0

42

Table 4: Annotation services.

Resource Method Arguments Description

getSCDList GET content: empty This service retrieves the registered list of SCD available at the raw data dataset.

addSCD POST Content: SCD file This service adds a SCD file to the list of available raw data dataset.

getSCD GET Content: SCD id This service retrieves SCD file corresponding to the given id.

getVCDList GET Content: SCD id This service retrieves the VCD file names corresponding to a certain SCD.

getVCD GET Content: SCD id, VCD id This service retrieves the VCD identified with a given VCD id inside the SCD.

updateVCD PUT Content: VCD, VCD id This service takes as input a given VCD content and integrates it into an existing VCD file.

deleteVCD GET Content: VCD id This service deletes a given VCD.

4.3.3 Dataset engine

The Dataset engine is the module that interfaces with the raw data datasets, i.e. the

recordings. In particular, video, Lidar and other sensor information is encapsulated into

RTMaps files (see section 4.5). Therefore, the Dataset engine will be implemented linked to

the RTMaps SDK, in order to access the data inside the RTMaps files and provide the

required services.

It is expected that standard one-minute video (or sensor) clips are used as the atomic unit of

information inside the Cloud-LSVA system. Related services will be provided to extract such

one-minute clips from the larger RTMaps files. Those files will be temporarily stored and a

database of such existing files will be maintained by the Dataset engine.

The Dataset engine is also responsible to creating and managing training sets (in general,

collection of images), created from annotations and usable by the Analytics engine.

D1.2 V1.0

43

Table 5: Data engine services.


createVideoClip GET Content: SCD file, time frame

This service gets a petition to create/extract a video clip from a source raw data specified at SCD.

The output is the path to the location of the generated video file.

deleteVideoClip GET Content: video clip name This service deletes an existing temporary video clip file used for annotation.

getVideoClipList GET Content: empty This service returns the names of the temporary video clips files existing in the system (this is a list of video clips under annotation).

4.3.4 Search engine The Search engine is the module that exposes services to find specific content inside the annotation

and raw data databases.

The annotation task starts with the use of the Search engine, in order to retrieve video content to

annotate. Queries can be administrative, i.e. related to metadata of the video, e.g. date, geolocation,

or type of sensor; or semantic, i.e. related to the content of the annotations, such as the presence of

specific type of objects in the image (e.g. “car”, “pedestrian”), or actions (e.g. “overtaking”).

Figure 34: Search Engine Framework

The beta version of the search engine queries and retrieves video based on location and date. A

simple user interface has been implemented for testing. This will be further described in Deliverable

3.3.

D1.2 V1.0

44

Table 6 Search Engine Services


search GET q_location, q_minDate, q_maxDate

Query the MongoDB to return video_id based on the location and data range.

preview GET video_id Preview video

4.3.5 Analytics engine

The Analytics engine is a web application that exposes services related to the analysis of the

content of the images. In particular, two main groups of analytics are considered:

Batch analytics tools: analytics related to the process of annotation of a given video:

e.g. detecting objects, segmentation, etc. These services analyse the entire video and

produce a single VCD output.

On-demand analytics tools: analytics related to the interaction between the user (via

the Web UI) and the back-end, to launch specific annotation tasks on a specific image or

frame of video (e.g. visual tracking of given bounding box at certain frame of a video).

The Analytics engine, upon request of one of the provided services, identifies which

algorithms to execute, and gets in contact with the Tools engine to launch them, and receive

the result back.

The following list the set of services that the Analytics engine exposes. The details about the

arguments and responses are yet to be defined:

Table 7: Analytics services for video annotation tools.


trackObject GET content: JSON message containing a bounding box and, frame number, and a video identifier.

This service launches visual tracking algorithms to track the given object in the sequence. The result is a VCD annotation with all the bounding boxes of the object along the sequence.

detectObject GET content: JSON message containing a video identifier and additional time intervals.

Content: name of object class to detect (e.g. “Car”, “Pedestrian”, “Lanes”).

This service launches a detector of the specified objects in the given video, by using detection-by-classification tools, or equivalent. The detector is selected from those available and using the models at the models database.

D1.2 V1.0

45

4.3.6 Upload engine

There are two main mechanisms to upload content into the Cloud-LSVA storages, one per

application domain:

ADAS: bulk direct transfer of content from NAS system into the storage device at

infrastructure level (typically connecting a USB 3.0 connected from a NAS system used

to record data in the vehicle into the NAS system of the Cloud-LSVA infrastructure). This

type of transfer is required to upload massive amounts of data from sensorised vehicles

(e.g. 20 TB from a 1-week recording session with an equipped vehicle).

Digital cartography: upload of small set of data (in the form of messages) from mobile

devices. The upload is achieved via exposing an upload service. The content of the

message can be the output of an on-board device that generates some metadata and a

picture of a certain situation of the road, e.g. a detected traffic sign.

In the first case, new recordings will be automatically detected by a daemon service that will

generate a Scene Content Description (SCD) file/entry describing the administrative

information of the recording (e.g. calibration file name, date, location, vehicle, etc.). The SCD

file will be stored in the annotation database.

In the second case, the exposed service will automatically handle the incoming information

and add it into the corresponding database.

4.3.7 Tools engine

In its most basic form, the tools engine is simply an application that can execute the

instantiation of containerised applications which may reside in a registry of available tools.

The most straightforward technology to use is Docker, where tool is then translated to

container, and the registry is the Docker Registry.

At development stage, the Docker engine will be used to launch container applications,

mainly from the Analytics engine. The Docker Registry will be managed manually to update

the available tools.

4.3.8 Pipeline engine

The pipeline engine is a middleware engine that determines which tools need to be

executed, using what underlying resources, and in which order. As described in section

4.1.3, this engine can be implemented, during the development of the project, in the form of

3rd party SW platforms.

Currently, the pipeline engine is mainly used in the semi-automatic annotation process. The

main idea is to improve the annotation efficiency by following the dataflow parading: each

annotation task is split into several atomic tasks that can be represented as a graph.

Executing the graph naturally fits the architecture of the pipeline engine. An example of

annotation pipeline is represented in the figure below. In particular, it shows every software

stack involved the instantiation of the pipeline and how the data flows between each node of

D1.2 V1.0

46

the pipeline.

At this time of the project the following tools are used to implement the annotation pipeline:

Apache Spark as a general data processing engine.

Luigi as a batch job pipeline orchestrator. We need such orchestrator to schedule

annotation task between automatic workers based on spark and manual workers that

use the manual annotation user interface.

Considering Kubernetes is still a good choice for its ability to manage and orchestrate

containerised applications (see section 4.1.2).

4.4 Communication and Data Format

4.4.1 Physical specifications

The TomTom services will be deployed as individual “Micro” Services in a lightweight

deployment container (Docker). By using individual service endpoints behind load balancers

we ensure that elasticity is part of the core cloud architecture.

4.4.2 Interface description

The TomTom services operate as isolated services and will use independent tooling. At

interface level the service will comply to the consortium eco system and use as much as

possible the common resources and interfaces.

In addition, the TomTom services will offer an interface for:

speed sign recognition

an endpoint to download the latest Lane Map update

All interfaces will be exposed as REST interface over http(s) using protobuf as binary

encoding format. When high-speed near real-time communication is needed a websocket

based solution will be offered in parallel based on the same protobuf messages.

D1.2 V1.0

47

Detailed technical interface descriptions will be produced as the project goes into it different

phases.

4.4.3 Annotation format

An in-depth discussion about annotation data model, and annotation file formats is provided

in deliverable “D3.1 Import/export interfaces and Annotation data model and storage

specification”. In this section, one of the options for the annotation data model (the Viulib

Video Content Description, VCD, www.viulib.org) is described according to the defined

requirements of section 3.4.

VCD is an annotation model and tool specially devised to describe content of image

sequences (or any other equivalent data sequence, such as point clouds), in the form of

spatial, temporal and spatio-temporal entities, called Elements. An Element in VCD can be

an Object, Event, Action, Context or Relation.

All the annotations in VCD can be grouped as follows:

Object - Contains information that can represent numerical magnitudes, such as

bounding boxes, polygons, points, or any array of numbers that represent arbitrary

information (e.g. steer angle and speed of vehicle at a given instant). Information

inside Object is organized in ObjectDataContainers which contain all the different

ObjectData that describe the Object at a given time point.

Action - Temporal Element which represents a semantic situation, such as an

activity or action of one or more Objects, described with a text string that can be the

URL of an ontology item (e.g. http://www.viulib.org/ontology/\#PersonRunning).

Event - A point in time that triggers some Action or appearance of an Object, and

that can be used to link actions in sequences, e.g. #PedestrianStartsWalking.

Context - Any other additional information of the scene that is not directly related to

Objects or Actions, but that increases the semantic load of the annotations: e.g.

#Raining, #Night, etc.

Relation - Elements can be connected via Relations. This type of annotation follows

the RDF triplets definition, which describes a rdf:subject, a rdf:predicate and an

rdf:object. Any VCD Element can be rdf:subject or rdf:object, depending on the

rdf:predicate. This type of annotation is extremely useful to identify participants of

scenes, their semantic implication, and for easy and fast retrieval of elements of a

scene.

Additionally, VCD contains information about the video under annotation, in the form of

administrative metadata (VideoMetaData), along with information of the source of

annotation, the URL of ontologies used, and the body of annotations.

VCD Structure

VCD structures its internal content hierarchically, using maps for the different Element types,

and referencing all data relative to the corresponding FrameIntervals. The Object type is of

particular relevance, since its information is structured into ObjectDataContainers which can

http://www.viulib.org/

http://www.viulib.org/ontology/#PersonRunning

D1.2 V1.0

48

host different nature information about the Object at each FrameInterval (see Figure 35 and

Figure 36).

Figure 35: Pseudo-UML diagram of VCD structure: Managers are the main orchestration tool of Elements.

Figure 36: Object is a special case as it holds its content as a ObjectDataContainer that can contain heterogeneous ObjectData such as bbox, string, polygon, etc.

The basic orchestration structure implies the utilization of Managers, one per Element type

(depicted as Manager<Object>, Manager<Event>, etc. in the figures). These management

structures hold two maps: the elementMap (e.g. actionMap for the Manager<Action>), and

D1.2 V1.0

49

the activeMap. The elementMap map uses as key the UIDs (Unique IDentifiers) of Elements

which are stored as values of the map. The activeMap maps which UIDs correspond to

active Elements at a given frame number. This second map is required to serialize content in

Frame-wise mode, as explained in subsequent sections.

ObjectDataContainer

At a given time point, an Object might need to be described using a variety of numerical

descriptors. As opposed to many other description languages, which primarily focus on a

single type (e.g. bounding boxes), the VCD allows to add as many descriptors as desired,

with a variety of types that allow the annotation of any type of information. For that purpose,

an Object contains an ObjectDataContainer, which manages ObjectData items inside.

ObjectData is the abstraction class of numerical entities such as bounding boxes, polygons,

points, circles, and general arrays. ObjectData can be named to add a semantic description

inside the Object (e.g. bounding box "body", bounding box "head").

As expected, the ObjectData type bounding box (represented as bbox in VCD) can be used

to annotate rectangles in images for generating ground truth of objects of interest, such as

vehicles or pedestrians, which is one of the more frequent use case of ADAS-related

datasets. Polygons can represent arbitrary shapes, so they fit well with pixel-wise labelled

images.

Other ObjectData serve to generic purposes. For instance, array can be used to define an

arbitrary number of magnitudes that represent some geometry or physical property of an

element of the scene. This flexible ObjectData is very useful to allocate space for

magnitudes that can be computed or derived from others, e.g. distance to other bounding

boxes, color, depth, etc. The VCD API allows adding, removing, and merging ObjectData

and ObjectDataContainer inside Objects.

Serialization modes

The serialization of VCD data can be executed in two modes: Element-wise and Frame-

wise. In Element-wise mode, all information of a given Element (e.g. Object) is grouped

together spanning its entire data within the sequence.

In Frame-wise mode, information is presented at Frame level, showing for each Frame only

the information of the Elements corresponding to that specific point in time (frame).

These two modes respond to two different uses of annotations. Element-wise mode

produces smaller payloads, in which information is grouped to better suit batch processing,

i.e. for rapid access to a single Element and all its information for the entire sequence.

Frame-wise is devised to support real-time annotation, in which information is produced and

updated sequentially, and thus can be sent as a message to other nodes in a computing

network.

Support for documents XML and JSON has been implemented, and different VCD types,

including Element-wise, Frame-wise and a mixed mode. Combinations of document and

D1.2 V1.0

50

VCD types is possible (i.e. JSON Element-wise, JSON Frame-wise, XML Element-wise,

etc.).

In Element-wise XML and JSON files, content is organized hierarchically from VCD down to

lists as Objects, Actions, etc., which contain each individual Object, Action, etc. Internally,

each Element contains its FrameInterval, name, type, etc.

Note that ObjectDataContainer can be defined for single frames, or for time intervals, which

is very useful to compress the annotation if the data (e.g. the bounding box) do not change

during a certain period.

In Frame-wise XML and JSON files, content is structured using Frames and Frame nodes.

Each Frame contains all the information of Elements that exists in the corresponding frame

number or time instant.

The Frame-wise mode produces notably larger files, since more overhead is created for

each frame. As a way to reduce the total disk size, a mixed mode has been implemented,

which builds a Frame-wise file but without the inner information of Elements. Instead, only

their UID is used as indicator of which Elements are active at each frame. The actual content

of these Elements is stored in a separate Element-wise file.

The JSON serialization provides a very compact representation of the annotations, which

greatly reduces storage space. Also, JSON format removes some overhead using the JSON

array feature, which enlists elements of an array using characters without the need to add

a textual tag.

Additional reduction of the data footprint can be achieved using serialization frameworks

such as Apache Avro4 or Google Protocol Buffers5. These tools compress the payload by

defining a schema (e.g. a JSON schema) and a binarization mechanism to generate

untagged data messages which can then be sent via sockets or message hubs (e.g. Apache

Kafka6), or stored in binary data files.

Scene Content Description

To this point we have presented the VideoContentDescription concept, as a model to

describe video content. In this section we present the SceneContentDescription (SCD),

which represents a higher description layer than VCD, responding to the need to coordinate

annotations from multiple data sources (e.g. video and point clouds) and the relationships

between annotations (see Figure 37).

4 https://avro.apache.org 5 https://developers.google.com/protocol-buffers/ 6 http://kafka.apache.org

D1.2 V1.0

51

Figure 37: The SCD encapsulates all the information of a recording session, including calibration files, and pointers to the different VCD files produced for the different sensors. Static content (e.g. list of recordings) reflects bibliographical information about the recording itself; dynamic content contains live information from the annotation process (e.g. which VCDs refer to the current SCD, or which is the association between labelled Elements across VCDs).

The SCD can then be understood as the description of the recording session, which has all

the required information to coordinate annotations across sensors. In fact, it can be used to

link to all those VCD files produced that represent annotations, including VCD from

detectors, human annotators, etc.

Calibration and Session information

In order to describe a set-up of multiple cameras, it is necessary to identify the spatial

relationship between each sensor and a universal coordinate system. This information can

be produced during installation and as a result, a calibration file can be generated and

pointed to from the SCD content. The reference coordinate system can be selected as the

center of the rear axis of the car, as specified by ISO 8855.

The information included as Session metadata may include the geolocation of the path

followed by the vehicle during the recording session, along with other information such as

the vehicle used, the date, the owner of the dataset, etc.

Association Manager

Sensorized test cars are equipped with a number of cameras to visualize all the

surroundings of the vehicle. This implies that the same scene is viewed from different

D1.2 V1.0

52

angles, and that elements of the scene need to be annotated for each view. For some

applications (e.g. lane change assist) it is beneficial to have the correspondences between

objects annotated in multiple views. This way, operations on the annotations such as search

or select can be easily mapped from one view to another.

Figure 38: Objects in different views can correspond to the same object in the real world. The SCD allows to identify these relationships and be used to transfer annotations from one view to another, or to enhance queries against annotated content.

The SCD contains an Association file which describes an N-dimensional matrix (one

dimension per camera), which determines which Objects correspond across views via their

unique identifiers (UID). An example simplified association matrix is illustrated in Figure 38.

4.5 Scene Recording Module Both in ADAS and Cartography generation use cases, an in-car data recorded will have to

be designed accordingly to the requirements and will have to cope with the vehicle sensors

setup.

The requirements in terms of performance and logging bandwidth is very different between

the two uses cases:

The ADAS use case will require logging of very high bandwidth sensor data streams

such as several HD video streams from multiple cameras while such streams are not

supposed to be altered by compression for instance as they will be used later on for

image processing algorithms execution, evaluation, benchmarking, and potentially

statistical proof of correct operation in an ISO 26262 certification process. As an

overview, the amount of sensors data to record can reasonably be higher than 1

GB/s (10 Gb/s). This means 4 TB/h.

The Cartography generation will not be as demanding however it integrates the

requirement to be able to stream data to the Cloud server in near real-time via mobile

communication networks. It would require a high-accuracy/high frequency positioning

system, potentially lower frame rate video captures, point-cloud acquisition for 3D

information, etc.

D1.2 V1.0

53

4.5.1 Recorder capabilities

The recorder will be based on a PC to which all sensors will be connected via various

interfaces (Ethernet, USB 3.0, CAN & FlexRay adapters, etc.). It must have the capability to

associate an accurate timestamp to each and every data sample acquired from the various

sensors. Such timestamps have to refer to the same timebase, most likely the GPS time.

Depending on the bandwidth to log, it is likely that the recording PC will be equipped with

racks of SSD disks mounted in RAID-0 (Stripping) mode. Such racks of disks have to be

easily extractible for data transfer (see next paragraph).

Figure 39: Recording PC

A tactile tablet can be connected to the recorder PC for the following purposes:

Specifying meta-information for the driven scenario (vehicle model, vehicle number,

approximate location, start time, driver name, sensors configuration…)

Monitoring the sensors data streams during recording: the driver or passenger can

then easily check whether all the expected sensors streams are correctly acquired

and detect any software, hardware, or connectivity issue.

Providing the passenger with a way to tag manually, via a tactile interface, some of

the situations encountered during driving.

The recorder may also be able to generate situation and event tags (e.g. “Driving on

highway / Driving in urban environment / Speed limit = … / Dangerous pedestrian / etc.”)

automatically in real-time thanks to a GPS sensors connected to a digital map allowing to

extract various information.

4.5.2 Data compression

For data which will not be transmitted to the cloud over the air, mainly due to bandwidth

limitations, data recordings will have to be extracted from the in-car data recorders, then

transferred to a local upload station before they are physically uploaded to the cloud via

standard fibre-optics networks.

Due to the fact that high volume data transfers to the cloud are long and costly, it is

necessary that the data, and particularly video data, will be compressed (still with lossless

compression) before it is transferred to the cloud.

D1.2 V1.0

54

Lossless compression algorithms are very demanding in terms of CPU load; therefore, it

seems not possible to apply lossless compression to the numerous recorded video streams

in real-time in the car.

This is why an intermediate upload station will be used for unloading the recorded data from

the vehicle, compute the lossless compression function there before finally uploading to the

cloud.

The data recorder itself has to provide the capability to accurately timestamp all the sensor.

4.5.3 Physical specification for sensors

ADAS use case

The recorder will be equipped with the following sensors:

Sensor type Number of sensors

Interface Frame rate / Resolution

Estimated bandwidth

Cameras 4 to 6 USB 3.0 Or GigaEthernet

Up to 60 Hz / 1280 x 1024

100MB/s per camera

Velodyne laser scanner

1 GigaEthernet 15 Hz, 64 layers

3 MB/s

Upload Station

Data Acquisition Vehicle

Cloud

Figure 39: Data acquisition and upload process

D1.2 V1.0

55

High accuracy INS (GPS + IMU)

1 Ethernet 100 Hz 1 MB/s

FlexRay interface

1 PCIe or USB ? ?

Cartography generation use case

Sensor type Number of sensors

Interface Frame rate / Resolution

Estimated bandwidth

Cameras 1 USB 3.0 Or GigaEthernet

30 Hz 10 Mbps (compression allowed)

High accuracy INS (GPS + IMU)

1 Ethernet 100 Hz 1 MB/s

Vehicle CAN bus

1 PCIe or USB 1 kHz 1 Mbps

4.5.4 File formats and chunking The recorder will have to record the different sensors streams in separate file, with standard

formats (as far as possible, and when compatible with the required performance).

This is described more in details in chapter 3.2.1.

Additionally, and for the sake of data exploitation optimisation once large amounts of data

are stored in the cloud and need to be post-processed, it will be necessary to be able to

automatically retrieve sub-sequences in time and sub-sets of the available data streams

without having to access the data containing the entire dataset.

Therefore, it will be necessary to cut recorded files in chunks (with configurable size at

record time, e.g. 4 GB chunks).

D1.2 V1.0

56

Figure 40: File chunking.

Once data files are split into chunks and the frontier timestamps between chunks is known in the database, it will be possible to dispatch the chunks on the cloud and access only the necessary chunks when user will need to read recordings sub-sequences in time and data streams sub-sets for annotation activities or function benchmarking.

4.5.5 Automatic annotations on in-vehicle embedded platforms

The Cloud-LSVA project has the main goal to develop a cloud-based platform for annotating

images. However, due to increased compute capability available in the embedded in-vehicle

platforms, it makes sense to take advantage of that compute capability and perform at least

some annotation work inside the vehicles, before the data is uploaded to the cloud. These

annotations generated by the vehicles would be transferred to the cloud together with the

rest of the data for further processing.

Ideally, the same annotation workloads running in the cloud would also run in the vehicles,

and all of the automated annotation work would be done in real-time in the vehicle. However,

depending on the actual compute capability of the embedded platform, as well as the

compute requirements of the actual workloads (for example, expressed in the required

number of floating point operations per second, i.e. FLOPS), this may not be possible. In

other words, some of the recorded frames would in that case be fully annotated in the

vehicle, but some would then have to be annotated in the cloud (they would be skipped

during the in-vehicle processing). The goal of Cloud-LSVA is therefore also to estimate how

much annotation work can be done in the vehicle vs. how much needs to be done in the

cloud, so that a better judgement can be made on the price vs. required performance

analysis of the in-vehicle platforms.

For scene segmentation workloads, which are typically required for adding pixel-level

annotations to the images, we estimate the required number of FLOPS to be on the order of

magnitude of 1-10 GFLOP operations required for a single pass through a deep neural

network (DNN). The actual number depends on the image resolution and the desired final

DNN architecture, but given the state of the art overview of DNNs for semantic segmentation

in [6], we think this is a reasonable estimate. If we want to compute the required theoretical

number of GFLOPS required by the platform, we must take into account the number of

images per second, as well as an overhead factor to account for the fact that we cannot

easily reach 100% of the specified theoretical compute capability of the platform in real-life.

We estimate the required peak performance of the platform to be 400 GFLOPS or more.

D1.2 V1.0

57

As a provider of in-vehicle compute platforms, Intel proposes the following types of platforms

to be investigated for the Cloud-LSVA project:

Xeon-based Intel Go automotive platform: here we would take advantage of the

server class CPU built into an embedded platform. Current available versions as of

this writing (January 2018) are based on dual socket Xeon CPUs based on the

Skylake architecture, with the total core count of 40+ (possibly 50+), and twice as

many number of compute threads. The theoretical peak performance depends on the

actual number of cores and the frequency, but the order of magnitude is 1-2 TFLOPS

per socket. The software support for Xeon is very good, and includes optimized

libraries (such as Math Kernel Library, or MKL) and support for major deep learning

frameworks (such as Caffe and TensorFlow).

Atom-based Intel Go platform: this platform is based on Intel Atom cores, which are

less powerful than Xeon cores, but the platform comes with the possibility to attach

accelerating hardware dedicated to high performance compute (HPC) workloads.

The current versions come with the Intel Arria 10 FPGA (field programmable gate

array), with the theoretical compute power of up to 1.5 TFLOPS. The drawback for

using FPGAs is the complexity associated with programming them efficiently for HPC

workloads, but the support for DNN type workloads is increasing rapidly due to the

popularity of FPGAs.

Intel IVI (In-Vehicle Infotainment) platforms: these platforms are based on the low-

power Atom cores, but typically come with integrated graphics processors, which

make them suitable to some extent for DNN workloads. The theoretical peak

performance is upwards of 100 GFLOPS, which may or may not be sufficient for the

Cloud-LSVA project, depending on the final workloads used. The future platforms are

expected to bring the performance close to 1 TFLOPS, but probably not in the

timeframe of the project. The support for DNN workloads is very good, with the

acceleration libraries dedicated to integrated graphics (e.g. clDNN library) and the

support for the major deep learning frameworks.

In conclusion, the Cloud-LSVA project will investigate several possible embedded platform

variants for annotating images inside the vehicle, and come up with the recommendation for

the type that makes most sense in terms of the required performance, but also price

associated with it.

D1.2 V1.0

58

4.6 Middleware and SDK’s

4.6.1 RTMaps

RTMaps (Real-Time Multimodal Applications) is a modular environment for rapid

development and execution of real-time multiple streams applications.

RTMaps is a generic tool but is however particularly suited for ADAS development, testing

and validation, as well as autonomous vehicles software integration, particularly for

applications related to perception, communication, data-fusion and decision-making.

With a component-based environment, RTMaps provides numerous on-the-shelf functions

for data acquisition from any kind of sources (cameras, laser scanners, radars, US sensors,

CAN & LIN bus, XCP, audio, analogue & digital I/Os, GPS, INS, biometrics sensors,

communication systems, eye trackers…), processing, 2D & 3D visualisation, as well as

synchronous recording and playback.

An easy-to-use graphical development environment allows swift application setup and

configuration.

Developers can easily integrate their functions and constitute their own library of

components thanks to a powerful C++ SDK (a Python SDK is currently being developed as

well).

This way, numerous partners can propose their own technology already integrated and

ready to use as RTMaps components such as SLAM, lane markings detection, road

detection, localisation, digital maps interfaces, obstacle detection, visibility assessment, etc.

RTMaps has been developed with simple ideas in mind:

• Ease of use,

• Ease of programming,

• Outstanding execution performance (intra-process components communication,

multithread, event-based scheduling, copyless data exchange, fixed memory

operation),

• Modularity for capitalisation of developments and cooperation between teams,

D1.2 V1.0

59

• Interoperability with other environments (such as Matlab® & Simulink®, ROS,

Qt/QML, various simulators…)

• Scalability (lightweight, portable and distributable runtime engine)

RTMaps can as well be controlled programmatically by third-party software thanks to a

control API. It can then be used in cars but also take part in automated processing tasks in

the cloud.

4.6.2 Computer vision and Machine learning SDKs The ability of Cloud-LSVA to annotate partly relies on the automatic computation of models

from labeled data, and to identify instances of Objects and Events on video. For that

purpose, computer vision and machine learning algorithms will be implemented. The vision-

related SDKs that are used are enlisted as follows:

Name Description License

OpenCV OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library.

BSD

Viulib® Viulib (Vision and Image Utility Library) is a set of precompiled libraries that simplifies the building of complex computer vision and machine learning solutions, specially focused on sectors such as ADAS.

Proprietary (Vicomtech-IK4)

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind.

BSD 2-Clause

DLib Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems.

Boost Software License.

TensorFlow TensorFlow is an open-source software library (written in Python, C++, and CUDA) for dataflow programming across a range of tasks. It is a symbolic math library, and also used for machine learning applications such as neural networks.

Apache 2.0 open source license

https://en.wikipedia.org/wiki/Open-source

https://en.wikipedia.org/wiki/Library_(computing)

https://en.wikipedia.org/wiki/Library_(computing)

https://en.wikipedia.org/wiki/Dataflow

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Neural_networks

https://en.wikipedia.org/wiki/Apache_License

https://en.wikipedia.org/wiki/Apache_License

D1.2 V1.0

60

4.6.1 Localisation Layers The ‘real-time map’ is stored on the navigation device and kept up-to-date by incremental

update services from the cloud.

The HD-MAP contains different layers the most important layer is the HD-Road layer that

contains highly detailed geometry of the road that can be used for automated-driving. This

layer has the centre lines, the lanes, the lane dividers and other parameters needed for

automation. To ensure that the car has the most “fresh” version of the map, the map is not

delivered in a traditional static way but streamed to the car. This delivery system is called

autostream and can currently deliver the Road Layer.

Part of the Cloud-LSVA project is to extend the autostream platform with delivery of the

localisation layers. One of these layers is the RoadDNA that is created from point cloud

information harvested by lidars.

We will focus to extend the platform with the SLAM layer created in phase 2 and 3 of the

prototyping. This will close the loop and ensures quick updates of this highly accurate

localisation layer. By having a car that source new features, a system that uploads to the

cloud, a fusion engine and a delivery system the loop is closed.

Figure 41: Example of SLAM features

D1.2 V1.0

61

Figure 42: RoadDNA example.

D1.2 V1.0

62

5. References

[1] Kooij, J. F. P., Schneider, N., Gavrila, D. M. (2014). Analysis of pedestrian dynamics from a vehicle perspective. In Proc. IEEE Intelligent Vehicle Symposium (IV), June 8-11, 2014, Dearborn, Michigan, USA, pp. 1445-1450.

[2] Ulbrich, S., Menzel, T., Reschka, A., Schuldt, F., Maurer, M. (2015). Defining and

Substantiating the Terms Scene, Situation, and Scenario for Automated Driving. In 2015 IEEE 18th International Conference on Intelligent Transportation Systems (pp. 982–988).

[3] Hülsen, M., Zöllner, J. M., Weiss, C. (2011). Traffic intersection situation description ontology

for advanced driver assistance. Proceedings of IEEE Intelligent Vehicles Symposium (pp.993–999).

[4] Feld, M., Müller, C. (2011). The automotive ontology. In Proceedings of the 3rd International

Conference on Automotive User Interfaces and Interactive Vehicular Applications - AutomotiveUI ’11 (p. 79). New York.

[5] Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing? International Journal of Human-Computer Studies, 43(5-6), 907–928.

[6] A. Garcia-Garcia, S. Orts-Escolano, S. O. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez (2017), A Review on Deep Learning Techniques Applied to Semantic Segmentation, CoRR, vol. abs/1704.06857

D1.2 – Requirements, specifications and reference architecture · D1.2 V1.0 2 Document Control...

Documents

Transcript of D1.2 – Requirements, specifications and reference architecture · D1.2 V1.0 2 Document Control...