Detecting, Localizing, and Recognizing Trees with a ... · Utsav Shah, Rishabh Khawad, and K...

Detecting, Localizing, and Recognizing Trees with a Monocular MAV:Towards Preventing Deforestation

Utsav Shah, Rishabh Khawad, and K Madhava Krishna

Abstract— We propose a novel pipeline for detecting, local-izing, and recognizing trees with a quadcoptor equipped withmonocular camera. The quadcoptor flies in an area of semi-dense plantation filled with many trees of more than 5 meterin height. Trees are detected on a per frame basis using stateof the art Convolutional Neural Networks inspired by recentrapid advancements showcased in Deep Learning literature.Once detected, the trees are tagged with a GPS coordinatethrough our global localizing and positioning framework. Fur-ther the localized trees are segmented, characterized by featuredescriptors, and stored in a database by their GPS coordinates.In a subsequent run in the same area, the trees that getdetected are queried to the database and get associated withthe trees in the database. The association problem is posed as adynamic programming problem and the optimal association isinferred. The algorithm has been verified in various zones in ourcampus infested with trees with varying density on the Bebop2 drone equipped with omnidirectional vision. High percentageof successful recognition and association of the trees betweentwo or more runs is the cornerstone of this effort. The proposedmethod is also able to identify if trees are missing from theirexpected GPS tagged locations thereby making it possible toimmediately alert concerned authorities about possible unlawfulfelling of trees. We also propose a novel way of obtaining densedisparity map for quadcopter with monocular camera.

I. INTRODUCTION

Trees are often found amidst agricultural farmland asa means of retaining subsoil moisture, improving groundwater levels, and to prevent soil erosion during times ofheavy rains. Indeed agro forestry [15], [7] where trees andshrubs are grown amongst crops as a means of sustainableagriculture is becoming widely popular. Also experts haveargued for tree crops wherein trees are grown as cropsin sloped terrains where normal farming can prove to beuntenable [14]. Nonetheless, in many developing nationstrees are myopically viewed as a source of immediate in-come, felled, and sold at marketplace as firewood. In caseof more profitable and expensive trees such as sandalwood,there exists organized network of people involved in rapiddestruction and smuggling of such species of trees.

In this paper we propose a novel framework capable ofdetecting, localizing, and recognizing trees. The monocularquadcoptor flies in a plantation area teleoperated by thehost gathering images along its route. The quad performs atranslational maneuver gathering images that mimic a stereopair. Through such translational maneuver, dense disparityand depth map of the scene is generated. Each such pair of

The authors are with Robotics Research Centre, KCIS, IIITHyderabad, India. {shah.utsav, rishabh.khawad}@research.iiit.ac.in, [email protected]

Fig. 1: Representative Output of Our System. Top: TreesGeotagged1by Our System on Map. Bottom: Dense DisparityMap Generation with Monocular Quadcopter.

images is further annotated by its GPS locations, the bearingwith respect to a global reference, and feature descriptors ofthe segmented trees in those pair of images. The trees arethemselves detected by state of the art Convolutional NeuralNetwork [12] adapted to the current context. The segmentedtrees are given GPS tags or locations based on the disparitymap generated and ego GPS coordinates of the quadcopterand are thus localized with respect to a global referenceframe. These are then stored in the database with their GPSlocations and feature descriptors.

In the query run, the quadcopter is made to move inthe same plantation area though not necessarily along thesame route. The quadcopter can enter and exit the site fromdifferent locations when compared with the database run. Asimilar process of tagging the trees with GPS coordinates en-sues. Through GPS locations and feature descriptors of seg-mented trees obtained across the two runs, putative matchesare initialized. The optimal association algorithm betweenthe query and database runs makes use of these putativematching score and initializes an Interpretation Tree [2] ina Joint Compatibility Branch and Bound [10] framework.

1In this work, all references to Geotags and GPS Coordinates are inthe Geographic Coordinate System of Latitude, Longitude, and Elevation.

Fig. 2: Framework. Column 1: Image Pair Obtained by Horizontal Translation of Quadcopter. Column 2: Top: Output ofConvolutional Neural Network, Bottom: Dense Disparity Map using Image Pair. Column 3: Distance and Bearings betweenQuadcopter and Trees. Column 4: GPS Coordinates Assigned to Trees.

The solution to the association problem is then obtained bydynamic programming over the Interpretation Tree. Figuresdescribing the result and the pipeline are depicted in Figure1 and Figure 2.

To the best of our knowledge there has been no such priorart on localizing trees with globally verifiable positions andfurther recognizing the presence of such trees in the futurethrough a combination of state of the art vision, machinelearning, and data association methods. That this pipelineproposes an initial solution to a very relevant problem inthe domains of agriculture and environmental conservationis the quintessential contribution of this effort. Apart fromthis, we have provided creative solutions to the problem offinding depth to trees and data association. Since monocularreconstruction and depth is often erroneous, the drone ismaneuvered to perform a purely translational motion tomimic a stereo pair. The dense disparity obtained from sucha maneuver often provides more accurate depth and mappingof trees than a routine motion with a monocular camera.

II. RELATED WORK

The task of detecting the trees have been approachedusing various techniques, most notably using airborne laserand lidar [6]. Lidar has also been used with MAV forsimilar purposes [18]. However, sensors like laser scannerand stereo camera are power consuming. This makes themless ideal if the area we need to cover is large. Laserscanners are also heavy in weight, making them unsuitablefor many publicly available MAVs. Moreover, laser scanningalgorithms are dependent on the density and clustering oftrees [17]. We use a single color image in our work obtainedthrough a monocular camera to detect multiple trees usingConvolutional Neural Networks.

Estimating the 3D structure of the scene in front ofMAVs is a fascinating and vibrant field. The bouquet ofmethods for it include using Range Sensors, Stereo Camera,Monocular Camera, Structure from Motion, and other featurebased techniques. Using active range sensors [4] is not veryinspiring option in our task for reasons we mentioned earlier.Monocular SLAM based approach [9] would need to see thescene repeatedly for a lot more than two views in order to getaccurate depth of the scene. Also we observe that VSLAM

breaks quite often in complex natural environment that wework in. Dense methods such as DTAM [11], [1] are alsoGPU intensive. Moreover, DTAM is susceptible to breakagedue to insufficient feature tracks if MAV pitches or rolls sig-nificantly. Whereas our method can provide dense disparityin just two views by creatively moving the monocular camerasuch that it mimics a stereo rig [16].

Data association in the context of SLAM has been studiedextensively [8], [3]. Developing map matching algorithmusing delaunay features [5] is not ideal for our work sincethere are cases when there are only one or two trees in front,or detected, in the scene. Moreover, the detected trees couldbe collinear or we detect more than three trees. We proposea dynamic programming based approach that exploits thegeometry and appearance similarities along with previoustrajectory information for tree matching and association.

Object detection has been rich domain in computer vision.Convolutional Neural Networks have dominated the field ofdetection in recent past [13], [12]. However, there is absenceof networks that can detect the trees reliably. In this work, wetrain a convolutional neural network inspired by [12] usingour dataset of trees that can detect multiple trees in givenimage.

III. SYSTEM: OVERVIEW AND ARCHITECTURE

Our system can be decomposed and explained functionallyand behaviorally. Functionally, it is composed of modules,each module achieving particular goal assigned as we de-scribe in next section. Behaviorally, our system can bethought of executing two different types of runs for eachlocation. We define first run as the Database Run and secondrun (or any run after first run) as Query Run. Each run can bethought of comprising steps, the modules performing sometask in each step.

Figure 2 depicts our system pipeline. We begin each stepby performing a horizontal translation (moving from rightto left) and obtaining image pair at ends of translation.This pair is fed to SPS Stereo [19] which outputs a densedisparity map with respect to left image. Along side, the leftimage is also passed to our Convolutional Neural Networkwhich is designed to detect trees. As column 3 of ourframework shows, the distance between quadcopter and tree

Fig. 3: Results of our Convolutional Neural Network trainedto detect trees.

along with bearing between tree and North with quadcopteras center is calculated using depth information and geometry.Using this information along with the current coordinates ofquadcopter, we tag each detected trees with their respectiveGPS coordinates as depicted in last column of figure.

Once the trees are tagged, the next task to perform dependson the run. In database run, we simply store the trees withtheir coordinates and feature descriptors. Associating sometree with itself in same run is trivial task as it boils down tofinding the closest tree in the map of current run. Problemsarise in association when it is to perform for different runs.We explain our solution to data association in detail insection 4.4. We manually remove some detected trees inthe query run and show efficiency of our system to detectanomalies.

IV. SYSTEM: MODULE DETAILS

Our system consists of four modules: 1) Tree Detection, 2)Depth Estimation, 3) GPS Tag Calculations, and 4) Matchingand Association.

A. Tree Detection Module

We use deep convolutional neural network for detectingmultiple trees in given image. Our neural network is inspiredby [12]. The network uses features from the entire image topredict each bounding box. It divides the input image into aS∗S grid, here we choose S to be 7. If the center of an objectfalls into a grid cell, that grid cell is responsible for detectingthat object. The network has 24 convolutional layers followedby 2 fully connected layers. The final output of our networkis the S ∗S ∗31 tensor of predictions, where third dimensionis computed as (5 prediction for each bounding box) * (2bounding box predictions for every cell) + (total classes).The tensor contains probabilities of 21 classes in each gridcell. We output the detection for each cell containing higherprobabilities than threshold. (The details of the architecturecan be found in [12]).

However, our network is different in one major way: wetrain the network with Pascal VOC 2007, 2010, 2012 and ourown dataset of trees. We created a dataset of more than 1000images, containing more than 4000 trees. We annotated the

dataset manually and trained the network to detect original20 class of Pascal VOC datasets along with trees (hencetotal 21 classes). We initialize the weights of the networkwith a model pretrained on imagenet and finetune it for ourdetection purposes. We start with learning rate 0.001 andSGD learning policy with batch size 64 and steps at 1000,2000, 5000, 10000, and 16000 iterations. We set momentum0.9 and decay rate at 0.0005. We train it for 25000 iterations.Leaky-ReLU is used as nonlinear activation.

Figure 3 shows the result of our detection module. Theresults show that we are able to detect multiple trees in differ-ent lighting conditions and at different heights of quadcopter.One important thing to note is that it is possible to detectall the trees in all the image by setting threshold very low.However, that results in large number of false positives andhence we instead work with the nearby trees which networkis able to detect with high confidence. This is fine in ourwork since quadcopter will eventually fly around all trees indifferent runs for multiple times.

B. Depth Estimation

In this module, we propose a novel way for calculatingdense disparity map using a quadcopter equipped with amonocular camera. The quadcopter captures the image at itscurrent position, then performs a horizontal translation, andagain obtain an image at the end of translation. This pair ofimages is fed to stereo correspondence algorithm. We useSPS Stereo [19] in our work. It outputs dense disparity mapcalculated using the image pair. Figure 4 shows output ofour method.

In this work, we only require distance between quadcopterand detected trees. This distance is calculated using therelationship between disparity and depth as follows

Z = f ∗ b/d (1)

Here, f is focal length in pixels, b is the baseline computedin meters, d is the disparity of the point in pixels, and Zis the perpendicular distance between camera plane and theplane containing tree. The baseline, b, is obtained using theodometry data from IMU of quadcopter. The final distanceand bearing between quadcopter and tree are calculated usingsimple geometry and trigonometry.

C. GPS Tag Calculations

This module geotags all detected trees. We have obtainedthe distance and bearing of trees from the depth estimationmodule and we know the current GPS position of thequadcopter. Moreover, as we just performed a horizontaltranslation, we can calculate the bearing of quadcopter withthe North. We use haversine formula to calculate the baselineusing the GPS as follows:

a = sin2(∆φ/2) + cosφ1 · cosφ2 · sin2(∆λ/2)

c = 2 · atan2(√a,√

1− a)

d = R · c(2)

Fig. 4: Results Showcasing Dense Disparity Generation. The Quadcopter performs a horizontal translation and captures pairof images at end of translation. This pair is used to calculate dense disparity map. Normalized for better visualization.

Here, c is the angular distance, and a is the square ofhalf the chord length between the points. d is the finaldistance between two points. φ is latitude, λ is longitude,and R is the Earth’s mean radius. We compare this baselinewith the baseline obtained from odometry data. If theyhave more discrepancy than some threshold, we discard thecurrent observations and begin again with detection anddepth estimation module. Such discrepancy usually pointstowards low confidence in GPS position

The bearing between the quadcopter and true north iscalculated using

θ = atan2(sin ∆λ · cosφ2, cosφ1 · sinφ2 − sinφ1 · cosφ2 · cos ∆λ)

(3)However, this bearing is calculated considering only quad-copter’s movement and not its viewing angle. Since we areperforming the horizontal translation, we add 90 degrees inthe calculated bearing, which gives us the bearing betweenthe viewing direction of quadcopter and the North. Now, weadd the bearing of individual trees with bearing of quadcopterwhich will give us final bearing of each tree with the Northwith respect to quadcopter’s current viewing direction.

Now we have the absolute bearing between tree’s directionfrom quadcopter and the North, GPS coordinates of quad-copter, and the distance between quadcopter and tree. Usingthis information, we calculate the GPS tag for each tree usingfollowing formula

φ2 = asin(sinφ1. cos δ + cosφ1. sin δ. cos θ)

λ2 = λ1 + atan2(sin θ. sin δ. cosφ1, cos δ − sinφ1. sinφ2)(4)

Here, θ is the bearing (clockwise from north), δ is the angulardistance (d/R). Hence, this module tags all the detected treeson map with global coordinates.

D. Matching and Association

The matching and association module is a very impor-tant module for both the runs and has varying degree ofcomplexity in different runs. In the first run, we achieve theassociation fairly easily by considering geometric proximityof observations. The task of association is considerably

difficult in second run. Since the initial GPS position fix canbe different than what it was in dataset run, the whole mapcreated in query run could be off by some measure. Hencewe can not directly use neighbourhoods between trees indifferent runs and associate the closest one as we did in firstrun. We solve the association in the second run using therelative geometry between trees and appearance similarityof the trees as we describe next.

Figure 5 shows the typical scenario of query run. Thedetected trees are shown as green nodes according to theirGPS coordinates (left). We want to assign the detected treesto the corresponding trees in the map generated in databaserun (right). We consider the definition of neighbourhoodas area covered with some radius from given referencecalculated using GPS information. In the map, red nodesfall outside the neighbourhood and hence are not utilised forfurther calculations. Among non-red nodes, blue nodes arethe ones that lie in the neighbourhood but are not associatedwith detected trees. The green nodes in both side are the onessupposed to match. These non-red nodes are used to findappropriate association between them and detected trees.

At the core of the association process lies the fact ofinvariability of the geometries. Though the map created bytwo runs could be slightly off by some margin in somedirection, the relative geometry between trees stays intact.The relative geometry is calculated in terms of bearing anddistance among trees. Along with the relative geometry,we also rely on the appearance similarity that we measurethrough the uniqueness of SIFT descriptor matches.

The data association problem is solved by a modificationof the Joint Probability Branch and Bound Algorithm [10].The JCBB generates various hypotheses association andsearches for the largest set that is jointly compatible. Thesearch itself is done incrementally over an interpretation treeI (Figure 6). The interpretation tree effectively prunes awaya number of invalid association very early in the hypotheseswithout having to traverse all branches of the tree for theoptimal solution.

Node in I is represented as nij = {(ai, bj), P (S, vij)}consisting of two elements, vij = (ai, bj), symbolizing thevertex of the node and P (S, vij) as the unique path from S to

Fig. 5: Left: Trees Observed in Current Image (GreenNodes), Position of Quadcopter (Triangle). Right: Map Gen-erated of GPS Coordinates of trees in Database Run. Non-red nodes are in the defined neighbourhood and hence areutilised further for association.

Fig. 6: Interpretation Tree I.

vij . Each ai, bj is a vector of GPS location and bearing , withai denoting the current observation and bj the observationfrom the database run.

The path element P (S, vij) is nothing but a unique enu-meration of the vertices from the source S to the considerednode nij,p along that path. Please note that while the verticesvij at a level i are not unique and appear more than once, thenode nij,p, associated with the vertex vij becomes uniquedue to the path element P . The figure shows two suchnodes associated with vertex a2b2 circled in green and red

with their corresponding pattern shown in green and red aswell. Define Cij,p as the combination set of all vertices inthe path element of the node nij,p taken two at a time.For example the circled blue node would have its Cij,p

as the elements {e12, e23, e31}, where each element eij =‖[ai − aj ]T [ai − aj ]− [bi − bj ]T [bi − bj ]‖. Then define thepotential associated with nij,p as

ψij,P = ψv(vij) ∗ ψe(Pij)

where Pij is P (S, vij) for short. ψv is the potential obtainedby matching feature description of the tree ai in current scanwith the tree bj in the database.ψe(Pij) =

∑ij∈Cij,P

eij thus measures the geometricconsistency of the current and database observation along thepath P. The optimal association is thus the optimal path fromS to the leaf nodes obtained by solving standard Bellmanequation in an incremental dynamic programming frameworkoptimizing over the node potentials. Since I is constructedincrementally, many computations and path searches aresuccessfully pruned. Further in all our experiment the treegrows at-most till 4 levels keeping the combination setassociated with each node well within computational limits.

V. EXPERIMENTS

We have evaluated our framework on a low cost commer-cial Bebop 2 quadcopter by Parrot. It is equipped with frontalmonocular camera, ultrasound altimeter, Ublox Neo M8NGPS module, and onboard IMU. It transmits frames at 640x 368 resolution at 30 fps to the host device through WiFiconnection. The number of maximum visible satellites are19. We use Robot Operating System (ROS) as middlewarefor communication with quadcopter. Because of the limitedcomputational power onboard, we perform all calculationson host system.

We conducted experiments on three locations on ourcampus. Table 1 and Table 2 provides statistics of ourexperiments for database run and query run, respectively.In the database run, the false positives are the cases inwhich any tree like structures (such as poles) are detected astrees. The false negatives are the scenarios where the CNNcannot to detect the tree even once in all scenes containingthe tree; typically the case of extremely bent shapes orunusual colors. The mean error in Table 1 represents theaverage difference between actual distance between treesand distance calculated using assigned GPS coordinates inmeters.

In the query run, we use the map generated from databaserun and perform data association. The false positive and falsenegative in context of query run implies wrong associations.They typically occur when there are errors in GPS datareceived. Our system can cope up with some false positivesin query run given multiple observations of trees (entry 1 ofTable 2).

One remarkable capability of our system is to detect themissing trees. We artificially remove some trees from thedetection of neural network to simulate the absence of trees.The last two rows of Table 2 represents these experiments.

Fig. 7: Results of Data Association for Set of Trees. We present results for small set for better visualization of some importantdetails. The two columns in Left are association between some trees in Location 1 in different runs, similarly right column isfor Location 2. The marginal shifts in the tagged coordinates between two runs can be seen in both cases. It is also evidentthat relative geometry between trees stays intact.

Once the association run is over and none of the trees arematched with the missing trees from database run, we canraise the flags with the GPS coordinates of missing trees.

Figure 7 depicts the association between small set (forbetter visualization) of trees in different run. As we men-tioned earlier, the map created in different runs could betranslated in some direction depending on the initial GPSlocation fix. However, our system uses robust features ofrelative geometry between trees and is able to associate withhigher accuracies.

VI. CONCLUSION

This paper showcased a novel pipeline for detecting,localizing, and recognizing trees making use of state of theart machine learning methods along with a novel way toobtain dense disparity map using monocular quadcopter. Inparticular the trees are detected and segmented leveragingthe recent explosive growth in the area of Convolutional andDeep Neural Networks. The segmented trees are localizedwith GPS tags and get recognized in subsequent queryruns through an optimal data association algorithm imple-mented as a solution to a dynamic programming problem.The method reveals high fidelity detection and recognitionover 3 datasets gathered by a monocular quadcoptor. Mostinterestingly the algorithm is able to detect that trees aremissing and thus serves as a potential solution for automateddeforestation detection. To the best of our knowledge neitherhas the problem of localizing and recognizing trees with amonocular quadcoptor been posed before nor has a solutionbeen presented.

ACKNOWLEDGEMENT

This work was supported by grants received from DeitY,National Program on Perception Engineering, Phase-II.

REFERENCES

[1] H Alvarez, LM Paz, J Sturm, and D Cremers. Collision avoidancefor quadrotors with a monocular camera. In Experimental Robotics,pages 195–209. Springer, 2016.

[2] Tim Bailey. Mobile robot localisation and mapping in extensiveoutdoor environments. PhD thesis, Citeseer, 2002.

[3] Tim Bailey and Hugh Durrant-Whyte. Simultaneous localization andmapping (slam): Part ii. IEEE Robotics & Automation Magazine,13(3):108–117, 2006.

[4] Adam Bry, Abraham Bachrach, and Nicholas Roy. State estimation foraggressive flight in gps-denied environments using onboard sensing.In ICRA 2012.

[5] Alexander Cunningham, Kai M Wurm, Wolfram Burgard, and FrankDellaert. Fully distributed scalable smoothing and mapping with robustmulti-robot data association. In Robotics and Automation (ICRA), 2012IEEE International Conference on, pages 1093–1100. IEEE, 2012.

[6] Harri Kaartinen, Juha Hyyppa, et al. An international comparison ofindividual tree detection and extraction using airborne laser scanning.Remote Sensing, 4(4):950–974, 2012.

[7] Orestis Kairis, Christos Karavitis, et al. Exploring the impact ofovergrazing on soil erosion and land degradation in a dry mediter-ranean agro-forest landscape (crete, greece). Arid Land Research andManagement, 29(3):360–374, 2015.

[8] Michael Montemerlo and Sebastian Thrun. Simultaneous localizationand mapping with unknown data association using fastslam. In ICRA2003.

[9] Raul Mur-Artal, JMM Montiel, and Juan D Tardos. Orb-slam: aversatile and accurate monocular slam system. IEEE Transactionson Robotics, 31(5):1147–1163, 2015.

[10] Jose Neira and Juan D Tardos. Data association in stochastic mappingusing the joint compatibility test. IEEE Transactions on robotics andautomation, 17(6):890–897, 2001.

[11] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison.Dtam: Dense tracking and mapping in real-time. In 2011 internationalconference on computer vision, pages 2320–2327. IEEE, 2011.

[12] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Youonly look once: Unified, real-time object detection. arXiv preprintarXiv:1506.02640, 2015.

[13] Shaoqing Ren, , et al. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In Advances in neural informationprocessing systems, pages 91–99, 2015.

[14] JR Rosell and R Sanz. A review of methods and applications ofthe geometric characterization of tree crops in agricultural activities.Computers and Electronics in Agriculture, 81:124–141, 2012.

[15] L Salvati, C Kosmas, et al. Unveiling soil degradation and de-sertification risk in the mediterranean basin: a data mining analysisof the relationships between biophysical and socioeconomic factorsin agro-forest landscapes. Journal of Environmental Planning andManagement, 58(10):1789–1803, 2015.

[16] Utsav Shah, Rishabh Khawad, and K Madhava Krishna. Deepfly:towards complete autonomous navigation of mavs with monocularcamera. In Proceedings of the Tenth Indian Conference on ComputerVision, Graphics and Image Processing, page 59. ACM, 2016.

[17] Jari Vauhkonen, Liviu Ene, et al. Comparative testing of single-treedetection algorithms under different types of forest. Forestry, pagecpr051, 2011.

[18] Luke Wallace, Arko Lucieer, and Christopher S Watson. Evaluatingtree detection and segmentation routines on very high resolution uavlidar data. IEEE Transactions on Geoscience and Remote Sensing,52(12):7619–7628, 2014.

[19] Koichiro Yamaguchi, David McAllester, and Raquel Urtasun. Efficientjoint segmentation, occlusion labeling, stereo and flow estimation. InEuropean Conference on Computer Vision, pages 756–771. Springer,2014.

Detecting, Localizing, and Recognizing Trees with a ... · Utsav Shah, Rishabh Khawad, and K...

Documents

Transcript of Detecting, Localizing, and Recognizing Trees with a ... · Utsav Shah, Rishabh Khawad, and K...