Detection of objects on the ocean surface from a UAV with ...

10
Detection of objects on the ocean surface from a UAV with visual and thermal cameras: A machine learning approach Ola Tranum Arnegaard * , Frederik S. Leira * , H˚ akon H. Helgesen * , Stephanie Kemna ** , Tor A. Johansen * * Center for Autonomous Marine Operations and Systems (AMOS), Department of Engineering Cybernetics, Norwegian University of Science and Technology, Trondheim, Norway. ** Maritime Robotics AS, Trondheim, Norway. Abstract— Unmanned aerial vehicles (UAVs) can provide great value in off-shore operations that require aerial surveil- lance, for example by detecting objects on the water surface. For efficient operations by autonomous aerial surveillance, a reliable automatic detection system must be in place: one that will limit the amount of false negatives, but not at the expense of too many false positives. In this paper, we assess multiple aspects of the detection system that may provide significant impact in off-shore aerial surveillance: First by assessing detection architectures based on convolutional neural networks, then by adding tracking algorithms to utilize temporal information, and finally by investigating the use of different imaging modalities. Through a comparison of several detection models, the exper- iments prove that misclassification of objects is a particular issue, where input resolution and size of objects influence the overall model performance. The use of a tracking algorithm allows for decreasing the confidence threshold, which results in fewer false negatives, without a significant increase in false positives. In addition, comparing information obtained from visual and thermal imaging systems shows that these modalities provide complementary information in the presence of sunlight reflection. I. INTRODUCTION The research in this paper is motivated by offshore op- erations with (fleets of) vessels that cannot quickly deviate from their operational trajectory or are towing equipment in some formation, for example for seismic operations or Navy fleets. Any obstacles in the path of the vessel may disrupt operations, which may lead to hours of lost operational time and potential damage to equipment, which is costly. This can be avoided if such obstacles are detected earlier with an aerial surveillance system, for example using UAVs to scan the ocean surface several kilometers ahead of the vessel(s). The task of the UAV is therefore to inform the operator or decision support system about floating objects of interest ahead of a vessel, which are identified through object detection 1 in UAV camera imagery [1]. Fig. 1 shows a simple illustration of the application. Compared to using a human observer to detect objects, UAVs are able to perform more accurate detections - both classification and localization of the object - at larger distances from the vessel. Another advantage is that the human observer can do more complex tasks instead. 1 In this paper, the term object detection refers to both localization and classification of objects. How well the UAV accomplishes this task is reflected by how well it is able to detect objects. This is the center of attention in this paper. The most critical objective of the detection system 2 is to avoid false negatives (FNs). This is simply because failing to detect and inform the operator about an object is highly undesirable, particularly in collision-avoidance scenarios since floating objects can cause damage to towed equipment. Furthermore, in order to avoid unnecessary replanning of the vessel’s path or excessive verification of detections by the human operator, a low number of false positives (FPs) is also desired. Intuitively, the ability to reduce FNs determines how much an operator can trust the system to detect any objects in the water. On the other hand, if this leads to high numbers of FPs and a need to constantly verify the detections, the system will require too much operator time. Consequently, there is often a compromise between these abilities. With this in mind, we investigate which aspects of an off-shore aerial detection system may provide a significant impact on these objectives. A. Related works Cazzato et al. [3] illustrate in their survey on object detection from UAVs that using object detection architec- tures based on convolutional neural networks (CNNs) may perform well on RGB images in real-time. This is among other metrics evaluated on mean average precision, which is a good indicator of the performance of FNs and FPs at various thresholds. The survey examines only detection on RGB images. However, in maritime scenarios, RGB imagery is heavily affected by the sunlight reflection (glare) on water surfaces [4]. In these scenarios, using other sensor modalities may be impactful, such as thermal imaging. In [5], Dahlin Rodin et al. apply Gaussian Mixture Modelling to extract maritime objects from thermal images and then classify the objects with a CNN. They show that since there are mainly “two modes present in thermal images at sea: the radiance reflected from the sky, and the heat emitted from the sea”, the separation between foreground and background is quite significant, i.e. object and ocean surface. This illustrates 2 In a detection application, one may choose to combine several strategies to improve the performance of the detection algorithm, such as using multiple modalities, processing techniques [2], detection methods, and meta information (spatial, temporal). For simplicity, we refer to this combination of strategies as a detection system.

Transcript of Detection of objects on the ocean surface from a UAV with ...

Detection of objects on the ocean surface from a UAV with visual andthermal cameras: A machine learning approach

Ola Tranum Arnegaard∗, Frederik S. Leira∗, Hakon H. Helgesen∗, Stephanie Kemna∗∗, Tor A. Johansen∗

∗Center for Autonomous Marine Operations and Systems (AMOS), Department of Engineering Cybernetics,Norwegian University of Science and Technology, Trondheim, Norway.

∗∗Maritime Robotics AS, Trondheim, Norway.

Abstract— Unmanned aerial vehicles (UAVs) can providegreat value in off-shore operations that require aerial surveil-lance, for example by detecting objects on the water surface.For efficient operations by autonomous aerial surveillance, areliable automatic detection system must be in place: one thatwill limit the amount of false negatives, but not at the expense oftoo many false positives. In this paper, we assess multiple aspectsof the detection system that may provide significant impactin off-shore aerial surveillance: First by assessing detectionarchitectures based on convolutional neural networks, then byadding tracking algorithms to utilize temporal information, andfinally by investigating the use of different imaging modalities.Through a comparison of several detection models, the exper-iments prove that misclassification of objects is a particularissue, where input resolution and size of objects influence theoverall model performance. The use of a tracking algorithmallows for decreasing the confidence threshold, which resultsin fewer false negatives, without a significant increase in falsepositives. In addition, comparing information obtained fromvisual and thermal imaging systems shows that these modalitiesprovide complementary information in the presence of sunlightreflection.

I. INTRODUCTION

The research in this paper is motivated by offshore op-erations with (fleets of) vessels that cannot quickly deviatefrom their operational trajectory or are towing equipment insome formation, for example for seismic operations or Navyfleets. Any obstacles in the path of the vessel may disruptoperations, which may lead to hours of lost operationaltime and potential damage to equipment, which is costly.This can be avoided if such obstacles are detected earlierwith an aerial surveillance system, for example using UAVsto scan the ocean surface several kilometers ahead of thevessel(s). The task of the UAV is therefore to inform theoperator or decision support system about floating objectsof interest ahead of a vessel, which are identified throughobject detection1 in UAV camera imagery [1]. Fig. 1 showsa simple illustration of the application. Compared to using ahuman observer to detect objects, UAVs are able to performmore accurate detections - both classification and localizationof the object - at larger distances from the vessel. Anotheradvantage is that the human observer can do more complextasks instead.

1In this paper, the term object detection refers to both localization andclassification of objects.

How well the UAV accomplishes this task is reflected byhow well it is able to detect objects. This is the centerof attention in this paper. The most critical objective ofthe detection system2 is to avoid false negatives (FNs).This is simply because failing to detect and inform theoperator about an object is highly undesirable, particularly incollision-avoidance scenarios since floating objects can causedamage to towed equipment. Furthermore, in order to avoidunnecessary replanning of the vessel’s path or excessiveverification of detections by the human operator, a lownumber of false positives (FPs) is also desired. Intuitively,the ability to reduce FNs determines how much an operatorcan trust the system to detect any objects in the water. Onthe other hand, if this leads to high numbers of FPs anda need to constantly verify the detections, the system willrequire too much operator time. Consequently, there is oftena compromise between these abilities. With this in mind,we investigate which aspects of an off-shore aerial detectionsystem may provide a significant impact on these objectives.

A. Related works

Cazzato et al. [3] illustrate in their survey on objectdetection from UAVs that using object detection architec-tures based on convolutional neural networks (CNNs) mayperform well on RGB images in real-time. This is amongother metrics evaluated on mean average precision, whichis a good indicator of the performance of FNs and FPs atvarious thresholds. The survey examines only detection onRGB images. However, in maritime scenarios, RGB imageryis heavily affected by the sunlight reflection (glare) on watersurfaces [4]. In these scenarios, using other sensor modalitiesmay be impactful, such as thermal imaging. In [5], DahlinRodin et al. apply Gaussian Mixture Modelling to extractmaritime objects from thermal images and then classify theobjects with a CNN. They show that since there are mainly“two modes present in thermal images at sea: the radiancereflected from the sky, and the heat emitted from the sea”,the separation between foreground and background is quitesignificant, i.e. object and ocean surface. This illustrates

2In a detection application, one may choose to combine several strategiesto improve the performance of the detection algorithm, such as usingmultiple modalities, processing techniques [2], detection methods, and metainformation (spatial, temporal). For simplicity, we refer to this combinationof strategies as a detection system.

the importance of input modality in a detection system,and that thermal imaging is suitable in off-shore aerialdetection. This is also shown in [6], where Dahlin Rodinet al. illustrate that objects at sea are detectable in bothvisual and thermal images through an evaluation of edgedetection accuracy on a set of objects. Furthermore, Li etal. show in a study on pedestrian detection that due to thecomplementary information from the thermal and visual data,a fusion architecture of both modalities performs better thaneither of the modalities alone [7].

Another area of interest is the utilization of temporalinformation in a detection system. Leira et al. [8] considerdetection, recognition, and tracking of objects at sea onthermal data from a UAV. By using detections from subse-quent time-steps, they keep track of multiple objects in real-time by applying a tracking algorithm which uses a Kalmanfilter and a constant velocity motion model. Furthermore,Xu et al. [9] illustrate that the combination of a simpletracking algorithm applied to the detections from a CNNobject detection architecture (YOLOv2) on a UAV providesgood real-time performance.

A synthesis of the contributions from these papers suggestsa set of aspects in off-shore aerial detection systems thatwould be valuable to investigate further.

B. Contributions

This paper contributes with a three-part examination onaspects of an off-shore aerial detection system, to investigatetheir impact on detection performance.

Firstly, we assess the performance of deep learning objectdetection architectures and address the key parameters thataffect the performance through a comparison of variousdetection models on off-shore aerial RGB images. Methodsbased on CNNs are applied due to their high performanceon RGB data, as mentioned in [3]. This analysis examinesthe importance of size of architecture, input resolution, andclassification. The comparisons are performed on a set ofopen-source state of the art architectures (i.e. YOLO), whichare easily adaptable to many different applications.

Secondly, we investigate how to utilize temporal informa-tion on detections. This is examined by an implementationof a real-time tracking algorithm on top of the detectionsfrom the object detection model. The aim is not to assessthe performance of the tracker, as in [8], [9], [10], but ratherto investigate how the tracker may be used to incorporatetemporal information into the detection system, to improvethe ratio of FNs and FPs by operating as a filter.

Finally, we assess the potential value of employing vi-sual (EO) and long-wave infrared (IR) spectrum data. Asshown by [7], a combination of both modalities providesimproved performance due to the complementary informa-tion in day/night scenarios. Furthermore, previous researchsuggested that both modalities are applicable in off-shoreaerial detection [6]. Consequently, we investigate if theyin fact provide significant complementary information inoff-shore aerial applications, especially due to issues suchas glare. To this date, there is little research that presents

Fig. 1. Simple illustration of a maritime vessel’s change of path due toinformation received from a UAV about a detected object.

information on this topic in an intuitive manner, particularlyin the domain of object detection. This paper will provide aqualitative comparison of object detection with EO and IR ona set of evaluation sequences, showing the complementaryinformation of the modalities.

In summary, we assess several aspects of a detectionsystem: the method of detection, utilization of temporalinformation, and value by operating with two modalities(EO/IR), in order to provide a more extensive analysis of off-shore aerial detection. All methods used in the experimentsare available in open-source repositories, and the resultsare mainly qualitative and agnostic to the method of objectdetection. Hence, the contributions are generalizable to otherdetection applications, as suggested in [5], [11], [12].

II. THEORY

A. Object detection

Artificial neural networks are learning systems inspiredby the neural structure in the brain. The power of a neuralnetwork is its ability to approximate complex functionsby changing its parameters. Convolutional neural networks(CNNs) are a genre of deep neural networks and are mostcommonly applied in image analysis. In short, CNNs arebased on the utilization of convolution and pooling to ex-tract information from an image. Identically to other neuralnetworks, the parameters of a CNN are often many and train-able, making it suitable to tasks such as image classification,object detection or segmentation, where the mapping froman input image to the desired output is highly complex.

Object detection is the process of localizing one or moreobjects within an image and classifying those objects usinga predefined set of categories. Intuitively, object detectionis a two-part process; first localization then classification.Even though popular two-stage architectures provide highperformance, one-stage detectors show similar accuracy atreal-time speed [13]. One-stage detectors predict the local-ization and classifications of the input image in one forwardpass rather than in two stages. Current state of the art objectdetectors are based on deep CNN architectures [14]. It iscommon for these architectures to include various algorithmsto improve detection performance. Consequently, an explana-tion of any specific architecture is likely to be quite extensiveas it will involve a range of significant additions, and tunableparameters. Given that multiple architectures are used in this

paper, we only present an overview of the architectures andrefer to the references for further details.

The first object detection architecture used in this paperis YOLOv3-SPP. YOLOv3 (2018) [15] uses Darknet-53 asfeature extractor (backbone). The largest improvement inYOLOv3 from the earlier versions is that it makes detectionson three different scales, and is able to combine the fine-grained information earlier in the network with the moresemantic information in subsequent layers. YOLOv3-SPP isinspired by the YOLOv3 architecture but implemented in themachine learning framework PyTorch by Glenn Jocher. Asthe name states, this version is also modified to include aSpatial Pyramid Pooling strategy (SPP) [16].

Secondly, we test YOLOv4 (2020) [13]. Inspired byCSPNet [17], YOLOv4 uses CSPDarkNet-53 as a featureextractor and applies a PANet [18] inspired strategy to fusethe features from the different layers in the backbone. Arange of other additions and changes is presented in the paper[13]. This research includes the implementation in DarkNet,which is the framework introduced by the original author ofthe YOLO family, Joseph Redmon.

The final architecture is YOLOv5 (2020) [19], which wasintroduced by Glenn Jocher. Some controversies have arisenaround the fact that, as of April 2021, no YOLOv5 paper ispublished. YOLOv5 is an extension of the YOLOv3 PyTorchrepository but has acquired some of the improvements intro-duced in YOLOv4. YOLOv5 includes four architecture sizes:s, m, l, x. An increase in size provides increased accuracy atthe cost of inference speed.

B. Tracking

Tracking is the task of generating time-dependent trajecto-ries for objects in a sequence of images. This paper considerstracking algorithms that depend on a separate method toprovide the object detections, hence they are agnostic to theobject detection method as long as detections are providedin each frame. We also consider multi-object tracking, whichis the process of tracking multiple objects in each imageframe. When multiple objects are tracked simultaneously,one must perform the task of associating existing tracks withnew detections.

The tracking algorithm ‘Simple Online and RealtimeTracking’ (SORT) is considered as a “pragmatic approachto the task of multi-object tracking” [20]. The algorithm isapplied on top of existing object detection models and willonly require the prediction and bounding box information.Each object is modeled with location, speed, aspect ratio,area and area rate as states. The objects are tracked inthe image frame. As mentioned in [20], “when a detectionis associated with a target, the bounding box is used toupdate the target state [and the velocity components areestimated based on the model in the Kalman filtering].If no detection is associated with the target, its state issimply predicted without correction using the linear velocitymodel”. The association between existing tracks and newbounding boxes is calculated with the Hungarian algorithm[20], using Intersection over Union (IoU) as cost. SORT has

a few controllable parameters: Probation period: The numberof times an object must be seen on the track before thetrack is acknowledged as true. This is important to validatenew tracks and avoid false positives; Termination period:If a track is not updated with a new detection within acertain period, it is terminated to avoid updating uncertaintracks; IoUmin: The minimum amount of overlap betweenthe position of existing tracks and new measurements to beconsidered an overlap at all.

III. METHODOLOGY

To perform the analysis in this paper, experimental datawere acquired. This section presents a study of the data andthe equipment used to collect it. Afterwards, an overview ofthe three main experiments of this paper will be presented.

A. Equipment

The image data were captured by a FLIR Duo Pro Rattached to a rig of a DJI M600 Pro UAV (see Fig. 2),operated by personnel from Maritime Robotics. The FLIRDuo Pro included both a visual and thermal camera andwas mounted at a 45◦ angle from vertical. It operated withthe specifications listed in Table I: the visual camera storedimages with three 8-bit channels (RGB) and the thermalcamera with one 14-bit channel. Note that the image capturefrom both cameras was not synchronized in time. Currently,this is not a feature available to the Duo Pro R camera.

The detection architectures that are used for localizingand classifying objects in the experiments are trained andevaluated with resources from Google Colab: NVIDIA TeslaP100-16GB GPU and 25GB RAM. One should note thatthese resources may not be representative of what may beexpected on board a small UAV. They were used to speedup the analysis for the present work.

TABLE IFLIR DUO PRO R SPECIFICATIONS [21]

Resolution Fps Depth Channels Field of ViewIR 640×512 9 Hz 14-bit 1 32◦ x 26◦EO 3840×2160 29.7 Hz 8-bit 3 56◦ x 45◦

B. Data

1) Annotations: The aerial video sequences recorded bythe UAV have a combined duration of ca. 1h and 45m foreach modality, and were captured in the Trondheim Fjord.The UAV was flown primarily at 100 to 120 meters abovesea level but is sometimes at heights as low as 50 meters (notconsidering take-off and landing). The IR frames contain 755labeled objects in 1001 images, while the EO frames contain595 objects in 700 images. The disproportionate number oflabeled objects between the modalities is discussed furtherin subsection III-D. The data include five different classes:boat, buoy, pallet, pole (tall buoy) and vest. The distributionof labels among the classes is shown in Table II. Further-more, the labeled data are divided into three sets: training,

Fig. 2. Left: Sensor rig including the Flir Duo Pro R. Right: DJI M600 Pro UAV with Sensor rig

validation, and test data. The division is as follows: 70% /15% / 15%. The object detection architectures are trained onthe training data, and the validation data are used to assessthe performance during training to stop updating the modelat the appropriate time. The final evaluation of the modelsis performed on the test set. The objects are relatively smallin size compared to the image size, which makes detectiona difficult task. Fig. 3 illustrates this by a density plot of thefraction size of the annotations relative to image size. In fact,it shows that most labels are close to 1%-2% of the imagewidth and height.

TABLE IITHE DISTRIBUTION OF THE LABELED CLASSES OF THE VISUAL IMAGES

Class Vest Pallet Buoy Pole Boat AllLabels 154 117 250 62 12 595

Fig. 3. The fraction size of the bounding boxes of the labeled data [19]relative to image size. The change in color from dark blue (low) to yellow(high) illustrates the density of objects at the given height and width. Hence,most labels are about 1%-2% of the image width and height. The outliersare likely from take-off and landing as the UAV is closer to the object.

2) Augmentation: Data augmentation is a technique togenerate artificial data from the existing dataset. This canbe utilized to obtain a more generalized model, i.e. nottoo specialized on the training data. This is particularlyuseful in applications with a limited amount of labeled data,such as in this paper, because the size and variation of the

dataset will increase without providing additional images.It is known that data augmentation may provide significantimprovements in object detection [13]. In this paper, thefollowing augmentations are applied for training: translation,flip, scale, hue, saturation, exposure, color value, and mosaic.These methods have shown good results in former research[13]. An augmentation method is either always applied to thedata (e.g. mosaic), applied with a certain given probability(e.g. flip), or applied to an image with a random amountup to a given limit (e.g. saturation). These techniques arepresented more thoroughly in the YOLOv4 paper [13, page3]. In this project, we use default augmentation values fromthe implementations of the architectures. Therefore, exceptfor mosaic, the amount of each augmentation varies tosome degree between YOLOv4 and YOLOv3-SPP/YOLOv5models. However, the differences are small and not expectedto significantly affect the results.

C. Experiments

The detection architectures and tracking methods consid-ered in this paper should be able to run in real-time ifprovided with access to sufficient computational power (see[15], [13], [19], [20]), hence this topic will not be discussedfurther. All models are trained until convergence on thevalidation dataset with batch sizes of 8 or 16, dependingon the input resolution.

1) Object detection - visual imaging: This experiment an-alyzes the detection performance of different real-time objectdetection architectures on aerial EO data captured with aUAV. Important aspects affecting the detection system are thedetection architecture size, input resolution and the necessityof classification. The detection is performed by using thearchitectures presented in Section II-A. By evaluating a setof models, we can indicate key parameters important tothe overall performance in the UAV application, which isspecified through the performance with respect to FPs andFNs. Multiple real-time architectures are included in order tohave a set of variations to compare, and to approximate howcomparable detection architectures may perform in similarapplications. However, it should be noted that comparingthe numerical results from this paper alone to detectionmodels from other papers is not considered useful, as theperformance of each model is confined to its training data.

All chosen architectures are from the YOLO family thathas gradually been improved since its original release in2016. Further in-depth information about this paper’s imple-mentation of the models is available in a GitHub repository[22]. Due to the simplicity of its implementation, YOLOv5is applied in multiple trials in this paper. This includesboth multi-class (MC) and binary (BIN) detection, differentarchitecture sizes and input resolution.

One should note that the training procedures are not fullydeterministic3, hence the trials may not accurately reflect thetrue capabilities of the architectures.

2) Tracking: In the second experiment, we assess thevalue of utilizing temporal information in the detectionsystem. This is achieved by including a tracking algorithm.The experiment is linked to the objectives presented in theintroduction as follows: a reduction of FNs is obtainable bylowering the confidence threshold in object detection, i.e.how confident the operator wants the model to be beforecategorizing a detection as an object. Due to the FN-FPdynamic4, this often introduces FPs. To address this issue,we attempt to apply multi-object tracking to the objectdetection pipeline such that sporadic FPs will be filteredout. This is because most FPs will be detected only fora few frames at most, similar to noise, and will thus notbe acknowledged as an object by the tracking algorithm.Thereby the detection pipeline also includes a temporalcomponent. We apply the multi-object tracking algorithmSORT [20] because of its simplicity in implementation aswell as its real-time performance. Due to its simplicity, thereare some scenarios where it is inferior to other trackingalgorithms. For instance, when objects reappear after beingoutside of the field of view, they may be considered as newobjects. SORT is implemented with YOLOv5s - the choiceof detection architecture is arbitrary.

The tracking algorithm is used as a tool to include tem-poral information in the detection system, and an analysis ofthe tracking performance against other tracking algorithmsis therefore not provided. Our current intention is not to findthe best tracking algorithm, but to investigate if it can beused to reduce the number of FPs in general. The evaluationis as follows: a detection model is tested on a video sequencewith and without tracking at two detection confidence thresh-olds, 0.75 and 0.25. The performance of the different trialsare then qualitatively evaluated. The thresholds are chosenempirically, after a brief evaluation of different confidencethresholds, such that they represent two values at oppositesides of the confidence spectrum.

3) Visual & long-wave infrared imaging: The final ex-periment investigates complementary information in twomodalities: We assess if utilizing a combination of EOand IR provides improved performance, particularly with

3The architectures include parts that are based on randomness (e.g.parameters, augmentation).

4As an example, to reduce the amount of FP, one might want to increasethe confidence threshold. However, this will rule out some of the low-confidence TP that now become FN. To make sure all true objects aredetected again, the threshold may be lowered. However, this results in anincrease in FP.

regards to FNs. This is assessed by evaluating two identicalobject detectors with different modalities with YOLOv5s.The choice of architecture is arbitrary. The evaluation is aqualitative comparison on a set of four video sequences. Tocompare how well each modality performs individually interms of object detection metrics is not considered importantat this stage. If they perform similarly on all data, usingboth modalities may cause unnecessary overhead in a UAVapplication. However, if they show complementary abilities,one might consider a detection strategy utilizing both. Onlybinary classes are considered in these trials (object/no-object). To reduce FPs, tracking is applied as well.

D. Error sources

This section discusses sources of error that may influencethe results in this paper. To gather data from video sequences,one may choose to select each x’th frame of a video, or atevery sufficiently different scenario. However, an issue ariseswhen similar scenes appear for a set of frames, or the changeof scenery is marginal. With a random division of data,training and test frames may have negligible differences,resulting in a model trained and tested on somewhat identicaldata. As a consequence, the test results will not accuratelyreflect the true capabilities of the detection model. In order toavoid that, in this paper, an attempt is made to label frameswith visible variations in the scenes. However, instances ofsimilar training and test data may still occur.

Another issue is that due to human annotation, only frameswith objects visible to the human eye are annotated. Forinstance, in the presence of glare, if an object is not detectedby the annotator and it is not possible to estimate where theobject is, the object is not annotated. If this is a frequentissue, the detection models will be unable to detect theseobjects due to a lack of labels. Nonetheless, this is notconsidered a significant issue in this research.

The IR video has a lower frame rate than the EO video.One might assume that the Kalman filter in SORT mayoperate better with less time between each frame. However,the tracking is purely used to compare the modalities, andslight improvements in performance due to that will not bea major issue in the analysis.

One should also note that the training data are extractedfrom the same flight sequences as the evaluation videos, bothin the tracking and EO/IR experiments. This should not sig-nificantly affect the qualitative conclusions, as it will at worstincrease the detection performance on the videos. Secondly,considering the length of the full dataset, the probability thata large amount of training data are extracted from the timespan of the evaluation sequences is not significant.

Finally, the data for this study were captured withinthe span of a couple of hours. Therefore, the state of thescenery does not vary much, e.g. sea condition, weather, andillumination. In addition, the objects are represented by atmost a few instances in the data. This may results in poorerdetection performance in other scenarios than those in thedata of this project, i.e. sensitive to the scenery.

IV. RESULTS & DISCUSSION

This section presents the results from the experiments, andassesses if the investigated aspects significantly impact theperformance of an off-shore aerial detection system, partic-ularly regarding the objectives presented in the introduction.

A. Object detection - visual imaging

The real-time detection architectures are evaluated on av-erage precision (AP.5), i.e. the area under the precision-recallcurve. This is a common metric in object detection, whichwill give an indication of the performance on FNs and FPsat various thresholds. Also, due to the uneven distribution oflabeled objects, a weighted sum of the AP.5 of the classesis added (All). The distribution of objects in the test set isas follows: 3 boats, 36 buoys, 20 pallets, 8 poles and 22vests. The test set consists of 108 images. Moreover, theconfusion matrices are calculated at a confidence thresholdof 0.25. This is chosen to be low enough to capture mosttrue objects while avoiding too many FPs. Tables III and IVshow the performance of the architectures on EO data fromthe UAV. All architectures are trained on multiple classesand input resolution of 640x640 pixels. However, YOLOv5sis also used in binary detection and with an input resolutionof 2144x2144 pixels.

TABLE IIIAP.5 PERFORMANCE OF OBJECT DETECTION ARCHITECTURES ON

VISUAL DATA.

All Boat Buoy Pallet Pole VestYOLOv3-SPP 0.84 0.45 0.74 0.97 1.00 0.90YOLOv5s 0.80 0.52 0.62 0.99 1.00 0.87YOLOv5l 0.84 0.83 0.72 0.92 1.00 0.90YOLOv4 0.91 1.00 0.90 0.93 0.88 0.91

TABLE IVAP.5 PERFORMANCE OF VARIATIONS OF YOLOV5S ON VISUAL DATA.

All Boat Buoy Pallet Pole VestYOLOv5s 0.80 0.52 0.62 0.99 1.00 0.87YOLOv5s-2k5 0.87 1.00 0.71 1.00 0.98 0.96YOLOv5s-BIN6 0.95

Primarily, the results in Table III illustrate that all modelsperform quite well on the AP.5 metric, even the smallerarchitecture YOLOv5s. Among all models in Table III,YOLOv4 performs best on a weighted sum evaluation (All).YOLOv4 is similar in size to YOLOv5l, which performsequally well or better than the final two architectures:YOLOv3-SPP and YOLOv5s. Moreover, Table IV showsthat YOLO5s is greatly improved by increasing the inputresolution (YOLOv5s-2k). However, it performs even betterwhen it only considers binary classification, i.e. anomalydetection.

5Input resolution of 2144 pixels as opposed to 640 pixels (see IV).6Binary classes (object/no-object) as opposed to multiple classes (MC)

(see IV). Therefore, there are no results for each class.

onWe consider the classification performance withYOLOv5, where we tested different image resolutions forinput. Considering the actual detections made by the model,i.e. not FNs and TNs (true negatives), Table V illustrates7

that misclassification is an issue for YOLOv5s. For in-stance, roughly a third of all detected buoys are classifiedincorrectly. However, Table VI shows that by increasing theinput resolution (YOLOv5s-2k), misclassification no longeris an issue. This makes sense intuitively, because a higherresolution includes details that may reveal differentiatingproperties between objects. Most misclassifications occur forthe smaller objects, buoy and vest, that have fewer distinctivefeatures. Considering the task of the UAV in this paper,one can assume that misclassifications are not critical aslong as the object is detected. However, in other scenarios,differentiating between critical and less important objectsmay be crucial, for instance for search and rescue where thedistinction between a buoy and a human (vest/life jacket)matters. Consequently, higher resolution may be of interest,such that subtle defining features are noticed. It could bevaluable to do a similar classification analysis for the otherdetection models from Table III, though similar results areexpected for different image input size.

By looking at the information provided by the confusionmatrices and Table IV, it is likely that the increased per-formance in YOLOv5s-2k is due to its ability to correctlyclassify the objects. However, one should note that there isno direct link between how well the model classifies objectsand its AP.5. Confusion matrices are calculated at a givenconfidence threshold, but an AP.5 metric is calculated forall thresholds. However, due to the low number of FPs andFNs in both evaluations of the confusion matrices, one mayassume that the ability to classify well is the most probablereason for the impressive performance of YOLOv5s-2k.

TABLE VCONFUSION MATRIX FOR YOLOV5S7

TrueBoat Buoy Pallet Pole Vest

Boat 1.00 0 0 0 0Buoy 0 0.64 0 0 0.10

Prediction Pallet 0 0 0.94 0 0Pole 0 0 0 1.00 0Vest 0 0.36 0.06 0 0.90

TABLE VICONFUSION MATRIX FOR YOLOV5S-2K7

TrueBoat Buoy Pallet Pole Vest

Boat 1.00 0 0 0 0Buoy 0 1.00 0 0 0

Prediction Pallet 0 0 1.00 0 0Pole 0 0 0 1.00 0Vest 0 0 0 0 1.00

An alternative to avoid misclassifications is to use a binary(BIN) detector. The results show that the YOLOv5s-BINmodel has the best performance. As YOLOv5s-2k has noissues in classification at threshold 0.25, one might assumeit should perform better, or at least equally well in AP.5, dueto its higher resolution. However, this is not the case andcan be explained by thresholds below 0.25: previous FNsmight now be detected by YOLOv5s-2k but misclassified.YOLOv5s-BIN will not have this issue for any thresholdbecause the model makes no distinction between classes andhence does not suffer from misclassifaction errors.

In this analysis, we have shown that a range of modelsperform quite well on aerial detection of objects on off-shore EO data. The results show that the input resolutioncan have significant impact on the model’s ability to classifyobjects, and binary classification may be most beneficial, ifthe application allows discarding object classes.

B. Tracking

The evaluation of the tracking is performed as follows:First, the YOLOv5s-BIN model is evaluated on a represen-tative video sequence of the dataset with a moderate objectdetection threshold of 0.75. Afterwards, the same procedureis executed with a threshold of 0.25. The hypothesis is thatthe amount of FNs decreases (↑ recall) but FPs increases(↓ precision). With the aim of reducing the FPs, SORTis then applied with the latter threshold value. Finally, theperformances on the videos are qualitatively compared. Theprecision-recall curve in Fig. 4 shows the sensitivity ofYOLOv5s-BIN at different confidence threshold levels. Thisshows that to obtain a highly reliable model, i.e. few FNs,one must roughly allow a precision as low as 0.2, where only1 out of 5 detections are correct. This is likely to correspondto a low confidence threshold, such as 0.25.

The SORT parameters are chosen from a brief qualitativeevaluation on a validation video sequence. In addition, toprevent the IoU-values from becoming too low due to thesmall size of the objects, all bounding boxes have their widthand length increased by 20 pixels before applying tracking.

• IoUmin = 0.1, because of the already small boundingbox sizes.

• Probation period = 10 detections, as the chance of tenFP in a row is quite low.

• Termination period = 10; a value to avoid too manyreappearances of objects in new tracks (ID switches).

Videos of each trial are available in [23]. The evaluationsequence lasts for 46 seconds and includes 9 objects8. For thefirst two trials, the number on each detection represents theconfidence. In the final video, the first number is the trackerID and the second represents the confidence. A summary of

7A confusion matrix is not an exhaustive performance metric in thisapplication due to the nature of object detection. However, the sole purposeof these figures is to give an illustration of the impact that classificationhas on the overall performance. A consideration of the matrices as a goodreflection of the true performance is justified by the non-significant numberof FPs and FNs in the test results (108 images).

Fig. 4. Precision-recall curve for YOLOv5s-BIN.

the trials is shown in Table VII, and a set of snapshots fromthe videos are presented in Fig. 5.

TABLE VIIKEY NUMBERS FROM THE TRACKING TRIAL EVALUATION (∼ 46s).

Undetected FalseObjects Positives

Threshold 0.75 1 of 9 0Threshold 0.25 0 of 9 ∼50

Undetected False IDObjects Positives Switches9

Threshold 0.25 + Tracking 0 of 9 0 8

Fig. 5, Table VII and the videos show that with a highconfidence threshold it is difficult to detect objects in allframes. More importantly, they show that 1 out of the 9objects is not detected at all: it is not detected at leastonce during its overall presence in the video (∼5 seconds).However, this is not an issue at a lower threshold of 0.25.Yet, this lower threshold introduces significant amounts ofFPs - on average more than 1 per second. By introducingthe tracking layer after detection, wrong detections (FP) arefiltered out without affecting the ability to detect true objects.It preserves high recall (low FN), yet increases precisionsignificantly (low FP). The detection system does lose objectsfrom time to time, but mostly manages to detect them againbefore the track termination period ends10.

When introduced to quick movements, the simple Kalmanfilter used in SORT is not able to follow the object, and newtracker instances are created, such as for the pole (∼27s). Infact, 7 out of the 8 false ID assignments in Table VII weredue to the pole’s non-smooth and rapid motions. Neverthe-less, such rapid motions are not highly applicable in mostapplications. The experiment proves that the inclusion oftemporal information through a tracker provides significant

8Conditions: (1) if one considers an object that goes out of frame as anew instance, (2) if one regards the mid-right object (at 0.25 threshold) asa bird, which are not considered objects of interest, (3) if one considers theobjects that briefly go in and out of the frame at the end as one object.

9Objects that are lost by its track and assigned (switch) to a new ID.10A tracker object is the same if it has the same ID number.

(a) A single snapshot with a threshold of 0.75. Only one of the two objects is detected, not only during t =1.00− 2.00s, but during their overall presence in the video (∼5 seconds)

(b) Multiple snapshots (5) with a detection threshold of 0.25 combined into one image. 7 distinct detections are madeduring t = 1.00 − 2.00s. Each detection is assigned the number of the frame it stems from. Red: FP. Green: TP.Despite having 5 FPs, both true objects are detected.

(c) A single snapshot with a threshold of 0.25 + tracking. Both objects are detected during t = 1.00−2.00s withoutany FP.

Fig. 5. Snapshots from a scenario from the evaluation sequence to illustrate differences between the trials. All TP and FP are marked.

improvements in an off-shore aerial detection system, bothin FNs and FPs. This is especially important for FN-criticalsystems, such as the one presented in this paper.

C. Visual & thermal infrared imaging

The aim of this section is to assess the complementaryinformation from EO and IR by evaluating models of bothmodalities on a few shorter videos. The results from themodels are shown side by side, more or less synchronized intime. The EO model in this experiment is the same model asin the previous assessment, YOLOv5s-BIN. The IR model isalso trained as a BIN detector, and achieves the impressiveAP.5 of 0.96. The results from the sequences are availablein [23] and illustrated in Figure 6, which includes threesnapshots from the videos. For the interested reader, alsoa full flight comparison (∼18 minutes) of how the modelsperform is available in [23]. This is one of several flightsfrom this project (all flights combined are roughly 1h 45m).

The full flight comparison and the EO results from theprevious assessments illustrate that the visual models showhigh overall performance, particularly with high resolution.However, the videos and Fig. 6 illustrate that even though theEO camera covers a larger view and can provide higher reso-lution than IR, the EO model performs poorly under certainconditions. On the other hand, under the same conditions,the IR model performs generally well. One may confidentlyassume that this is due to the improved information providedby the IR modality in these scenarios. A visual inspec-tion proves that IR has significantly fewer problems withglare. Consequently, this example qualitatively shows thattwo modalities, which on their own are proven to performwell in off-shore aerial detection, provide complementaryinformation. Furthermore, by utilizing both modalities (e.g.through fusion), off-shore aerial detection systems are likelyto detect objects that otherwise would not be detected witheither of the modalities alone. This is primarily due to glare,but also situations with low illumination (e.g. night) are likelyto show similar results.

V. CONCLUSION

This paper investigated several aspects of a detectionsystem: the method of detection, utilization of temporalinformation, and value added by operating with two modali-ties (EO/IR). All methods used are available in open-sourcerepositories, and the results are mainly qualitative and agnos-tic to the object detection method, such that the contributionsare generalizable to other detection applications. In addition,video sequences are provided to substantiate the conclusionsand convey the results in an intuitive matter.

Through a three-part examination, we have assessed as-pects that may have significant impact on the performance ofoff-shore aerial detections. The experiments prove through acomparison of several detection models that misclassificationof objects is a particular issue, and that input resolution andsize of objects influence the overall detection performance.The experiments also show that the consideration of temporal

information may provide significant improvements in an off-shore aerial detection system. This is proven by the imple-mentation of a simple tracking algorithm, which performssimilar to a filter of false positives. This enables a decreasein confidence threshold, which results in fewer FNs, withouta significant increase in FPs. Finally, a study about theinformation provided by EO and IR imaging shows that themodalities provide complementary information in off-shorescenarios, particularly due to the presence of sunlight reflec-tion but likely also in other situations, e.g. low illumination.This suggests that off-shore aerial applications may benefitby utilizing both modalities. Exploring further integration,e.g. through a fusion strategy, will be a natural next step,best achieved with a camera setup where EO and IR imagerycan be accurately timestamped and/or synchronized in time.

ACKNOWLEDGMENT

This work was partly funded by The Research Councilof Norway, Project 296690, MAROFF-2, and the ResearchCouncil of Norway, Equinor and DNV through the Centrefor Autonomous Marine Operations and Systems (NTNU-AMOS), project number 223254.

REFERENCES

[1] T. A. Johansen, and T. Perez, ”Unmanned Aerial Surveillance Systemfor Hazard Collision Avoidance in Autonomous Shipping,” in Inter-national Conference on Unmanned Aircraft Systems, Washington DC,2016.

[2] ”What is MSX?”, June 24, 2019. Accessed: Feb. 7, 2021. [Online].Available: https://www.flir.eu/discover/professional-tools/what-is-msx

[3] D. Cazzato, C Cimarelli, J. Sanchez-Lopez, H Voos, M. A. Leo,”Survey of Computer Vision Methods for 2D Object Detection fromUnmanned Aerial Vehicles,” Journal of Imaging, vol. 6, no. 8, pp. 78,2020.

[4] H. Kwon, J. Yoder, S. Baek, S. Gruber, and D. Pack, ”Maximizingtarget detection under sunlight reflection on water surfaces with anautonomous unmanned aerial vehicle,” in International Conference onUnmanned Aircraft Systems (ICUAS), Atlanta, GA, 2013.

[5] C. Dahlin Rodin, L. N. de Lima, F. A. A. Andrade, D. B. Haddad, T. A.Johansen, and R. Storvold, ”Object Classification in Thermal Imagesusing Convolutional Neural Networks for Search and Rescue Missionswith Unmanned Aerial Systems,” in International Joint Conference onNeural Networks (IJCNN), Rio de Janeiro, 2018.

[6] C. Dahlin Rodin, and T. A. Johansen, ”Detectability of Objects atthe Sea Surface in Visible Light and Thermal Camera Images,” inMTS/IEEE OCEANS, Kobe, Japan 2018.

[7] C. Li, Dan Song, R. Tong, and M. Tang, ”Illumination-aware FasterR-CNN for Robust Multispectral Pedestrian detection,” Pattern Recog-nition, vol. 85, pp. 161-171, 2019.

[8] F. S. Leira, H. H. Helgesen, T. A. Johansen, and T. I. Fossen, ”Objectdetection, detection, and tracking from UAVs using a thermal camera,”Journal of Field Robotics, vol. 38, no. 2 pp. 242-267, 2021.

[9] S. Xu, A. Savvaris, S. He, H. Shin, and A. Tsourdos, ”Real-time Im-plementation of YOLO+JPDA for Small Scale UAV Multiple ObjectTracking,” in International Conference on Unmanned Aircraft Systems(ICUAS), Dallas, TX, 2018.

[10] H. H. Helgesen, F. S. Leira, T. A. Johansen, ”Colored-Noise Trackingof Floating Objects using UAVs with Thermal Cameras,” in Interna-tional Conference on Unmanned Aircraft Systems (ICUAS), Atlanta,GA, 2019.

[11] K. Kanistras, G. Martins, M. J. Rutherford, and K. P. Valavanis, ”ASurvey of Unmanned Aerial Vehicles (UAVs) for Traffic Monitoring,”International Conference on Unmanned Aircraft Systems (ICUAS),Atlanta, GA, 2013.

(a) Snapshot from Sequence 4A.

(b) Two consecutive snapshots from Sequence 4B.

Fig. 6. Snapshots from the evaluation sequences illustrating the complementary information of visual and thermal imaging in an off-shore aerial scenario.

[12] A. Tunnicliffe, ”Eye in the sky – the role of drones inoil spill management,” Offshore Technology, Dec. 2, 2019.Accessed: Feb. 9, 2021. [Online]. Available: https://www.offshore-technology.com/features/eye-in-the-sky-the-role-of-drones-in-oil-spill-management/

[13] A. Bochkovskiy, C. Wang, and H. M. Liao, ”YOLOv4: Optimal Speedand Accuracy of Object detection,” 2020, arXiv:2004.10934. [Online].Available: https://arxiv.org/abs/2004.10934

[14] Z. Zou, Z. Shi, Y. Guo, and J. Ye, ”Object Detection in 20Years: A Survey,” 2019, arXiv:1905.05055. [Online]. Available:https://arxiv.org/abs/1905.05055

[15] J. Redmon, and A. Farhadi, ”Yolov3: An incrementalimprovement,” 2018, arXiv:1804.02767. [Online]. Available:https://arxiv.org/abs/1804.02767

[16] K. He, X. Zhang, S. Ren, J. Sun, ”Spatial Pyramid Pooling in DeepConvolutional Networks for Visual detection,” Computer Vision –ECCV, pp. 346-361, 2014.

[17] C. Wang, H. M. Liao, I. Yeh, Y. Wu, P. Chen, and J. Hsieh, ”CSPNet:A New Backbone that can Enhance Learning Capability of CNN,”IEEE/CVF Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 1571-1580, 2020.

[18] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ”Path Aggregation Network

for Instance Segmentation,” in IEEE/CVF Conference on ComputerVision and Pattern Recognition, Salt Lake City, UT, 2018.

[19] G. Jocher, ”ultralytics/yolov5,” Github Repository,YOLOv5. Accessed: Dec. 1, 2020. [Online]. Available:https://doi.org/10.5281/zenodo.4154370

[20] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, ”Simple Onlineand Realtime Tracking,” in IEEE International Conference on ImageProcessing (ICIP), Phoenix, AZ, 2016.

[21] ”FLIR Duo Pro R User Guide Version 1.0,” Dec.,2017. Accessed: Nov. 1, 2020. [Online]. Available:https://www.flir.com/globalassets/imported-assets/document/duo-pro-r-user-guide-v1.0.pdf

[22] O. T. Arnegaard, ”Aerial Object Deteciton EO-IR v1.0,” GithubRepository v1.0. Accessed: Feb. 9, 2021. [Online]. Available:https://github.com/olatar/Aerial Object Deteciton EO-IR

[23] O. T. Arnegaard, ”Evaluation videos: Tracking + EO/IR,”Youtube. Accessed: Feb. 27, 2021. [Online]. Available:https://youtube.com/playlist?list=PLjE3SYK7kAIAzR4DGy-xumeXSZeZ5w7Ms