Performance Evaluation of Feature Descriptors for Aerial ...

7
Performance Evaluation of Feature Descriptors for Aerial Imagery Mosaicking Rumana Aktar * , Hadi Aliakbarpour * , Filiz Bunyak * , Guna Seetharaman , Kannappan Palaniappan * * Computational Imaging and VisAnalysis (CIVA) Lab Department of Computer Science, University of Missouri-Columbia, MO 65211 USA Emails: [email protected], {aliakbarpour, bunyak, palaniappank}@missouri.edu US Naval Research Laboratory, Washington D.C., USA Email: [email protected] Abstract—Mosaicking enables efficient summary of geospatial content in an aerial video with applications in surveillance, activity detection, tracking, etc. Scene clutter, presence of distractors, parallax, illumination artifacts i.e. shadows, glare, and other complexities of aerial imaging such as large camera motion makes the registration process challenging. Robust feature detection and description is needed to overcome these challenges before registration. This study investigates the com- putational complexity versus performance of selected feature detectors such as Structure Tensor with NCC (ST+NCC), SURF, ASIFT within our Video Mosaicking and Summariza- tion (VMZ) framework on VIRAT benchmark aerial video. ST+NCC and SURF is very fast but fails for few complex imagery (with occlusion) from VIRAT. ASIFT is more robust compared to ST+NCC or SURF, though extremely time con- suming. We also propose an Adaptive Descriptor (combining ST+NCC and ASIFT) that is 9x faster than ASIFT with comparable robustness. 1. Introduction Mosaicking is a fundamental task in computer vision and studied in large extend remote sensing for decades [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Mosaicking or panorama is comprised from set of spatially related images by aligning those to a specific frame coordiante. Compre- gensive summary of mosaicking has been put together in detail in review studies [1], [2]. Applications of mosaicking include a variety of fields such as wide-field of view gener- ation, video compression, medical imaging, surveillance of military field, monitoring of urban area or agricultural field, checking of restricted areas after catastrophic events, such as earthquakes, tsunamis, damage to nuclear power plants, etc. Feature based image registration techniques mainly match sparse keypoints provided by feature descriptors in order to find transformation relationship between two im- ages given some area of overlap. The choice of descriptor plays a vital role in determining robustness and efficiency of mosaicking algorithms. One of the earliest mosaicking algorithm is AutoStitch with robust SIFT [14] features [13]. AutoStitch is invariant to image ordering, orientation, scale Figure 1. Video Mosaicking and Summarization (VMZ) Pipeline. and illumination as well as insensitive of noisy images. Tao at el. [5] applied SIFT descriptor and developed a graph based approach for scene stitching in large-scale aerial data. They exploited the property of frames being overlapped in time and space, and stitched the images by temporal grouping, retrieved spatial cross-group and finally generated a single mosaic from each cross-group. Gonalves at el. [4] used robust feature detector, SIFT [14] as well with information from image segmentation followed by a robust outlier removal method to register images. Many other stud- ied have been reported to use SIFT for image stitching [15], [16], [17], [18]. SURF features are reported to be as robust as SIFT and more efficent in terms of time complexity [30]. Consequently, many mosaicking algorithms have worked on SURF features to determine image matching such as [19], [20], [21], [22]. Hoang at el. [10] proposed a registration algorithm with Harris corner and KLT features for detecting and tracking moving objects in UAV videos. Once the frames are aligned by registration, forward/backward frame differencing and morphological operation is performed in order to estimate blob. Finally, by the use of spatial temporal blob detection, 978-1-5386-9306-3/18/$31.00 c 2018 IEEE

Transcript of Performance Evaluation of Feature Descriptors for Aerial ...

Page 1: Performance Evaluation of Feature Descriptors for Aerial ...

Performance Evaluation of Feature Descriptors for Aerial Imagery Mosaicking

Rumana Aktar∗, Hadi Aliakbarpour∗, Filiz Bunyak∗, Guna Seetharaman†, Kannappan Palaniappan∗∗Computational Imaging and VisAnalysis (CIVA) Lab

Department of Computer Science, University of Missouri-Columbia, MO 65211 USAEmails: [email protected], {aliakbarpour, bunyak, palaniappank}@missouri.edu

†US Naval Research Laboratory, Washington D.C., USAEmail: [email protected]

Abstract—Mosaicking enables efficient summary of geospatialcontent in an aerial video with applications in surveillance,activity detection, tracking, etc. Scene clutter, presence ofdistractors, parallax, illumination artifacts i.e. shadows, glare,and other complexities of aerial imaging such as large cameramotion makes the registration process challenging. Robustfeature detection and description is needed to overcome thesechallenges before registration. This study investigates the com-putational complexity versus performance of selected featuredetectors such as Structure Tensor with NCC (ST+NCC),SURF, ASIFT within our Video Mosaicking and Summariza-tion (VMZ) framework on VIRAT benchmark aerial video.ST+NCC and SURF is very fast but fails for few compleximagery (with occlusion) from VIRAT. ASIFT is more robustcompared to ST+NCC or SURF, though extremely time con-suming. We also propose an Adaptive Descriptor (combiningST+NCC and ASIFT) that is 9x faster than ASIFT withcomparable robustness.

1. Introduction

Mosaicking is a fundamental task in computer vision andstudied in large extend remote sensing for decades [1], [2],[3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Mosaicking orpanorama is comprised from set of spatially related imagesby aligning those to a specific frame coordiante. Compre-gensive summary of mosaicking has been put together indetail in review studies [1], [2]. Applications of mosaickinginclude a variety of fields such as wide-field of view gener-ation, video compression, medical imaging, surveillance ofmilitary field, monitoring of urban area or agricultural field,checking of restricted areas after catastrophic events, suchas earthquakes, tsunamis, damage to nuclear power plants,etc.

Feature based image registration techniques mainlymatch sparse keypoints provided by feature descriptors inorder to find transformation relationship between two im-ages given some area of overlap. The choice of descriptorplays a vital role in determining robustness and efficiencyof mosaicking algorithms. One of the earliest mosaickingalgorithm is AutoStitch with robust SIFT [14] features [13].AutoStitch is invariant to image ordering, orientation, scale

Figure 1. Video Mosaicking and Summarization (VMZ) Pipeline.

and illumination as well as insensitive of noisy images. Taoat el. [5] applied SIFT descriptor and developed a graphbased approach for scene stitching in large-scale aerial data.They exploited the property of frames being overlappedin time and space, and stitched the images by temporalgrouping, retrieved spatial cross-group and finally generateda single mosaic from each cross-group. Gonalves at el.[4] used robust feature detector, SIFT [14] as well withinformation from image segmentation followed by a robustoutlier removal method to register images. Many other stud-ied have been reported to use SIFT for image stitching [15],[16], [17], [18]. SURF features are reported to be as robustas SIFT and more efficent in terms of time complexity [30].Consequently, many mosaicking algorithms have worked onSURF features to determine image matching such as [19],[20], [21], [22].

Hoang at el. [10] proposed a registration algorithm withHarris corner and KLT features for detecting and trackingmoving objects in UAV videos. Once the frames are alignedby registration, forward/backward frame differencing andmorphological operation is performed in order to estimateblob. Finally, by the use of spatial temporal blob detection,

978-1-5386-9306-3/18/$31.00 c©2018 IEEE

Page 2: Performance Evaluation of Feature Descriptors for Aerial ...

Figure 2. VMZ Decision Engine. There are 4 main sub-modules forcontrolling VMZ operations: Resource Management, Feature Management,Reference/KeyFrame Hypervisor and Homography Estimation and Regis-tration.

object tracking is accomplished. David at el. [23] propose asystem of mosaicking multicamera images where harris cor-ner is chosen as feature detector due to its high repeatabilityunder radiometric and geometric changes as well as avail-ability of abundant corners points in urban scene images. Inorder to match key features in overlapping images, minimumspanning tree based coherence is implemented. Many otherimage matching and mosaicking algorithms have followedKLT features [24], [25]. In order to acheive robustness forcomplex dataset, ASIFT has been proved to be an invaluablefeature descriptor [12], [26], [27]. Structure Tensor andNormalized Crosscorrelation is also used for registration andmosaicking [11], [28] for fast image mapping.

In this study, we look for an efficient and robust fea-ture descriptor particularly for benchmark VIRAT aerialimagery. Mosaicking of VIRAT is a difficult work due toits challenges. For example, video sequence covers longspatial region which can be few to 10 square miles [29].Wide range of objects such as long roads with moving cars,trucks and other vehicles as well as pedestrians, parked cars,buildings, wide vegetations, shadows of tall buildings canappear at any place adding a lots of complexity in dataset.Furthermore, low contrast aerial videos are captured over awide field-of-view of 160 degrees or more at low frame ratesof several hertz [29]. The occlusion due to the presence oflarge buildings, and dramatic change of camera position andscene add further twist in aerial dataset [29].

Thus we take atvantage of fast and time efficient de-scriptors such as ST+NCC or SURF when there is smallmotion between two images without any occluded part.Next, more robust ASIFT descriptor is applied when match-ing becomes difficult and comparatively cheap descriptorperforms poorly. This adaptive nature of feature descriptorprovides a balance between efficiency and performance foraerial imagery. We explicitly compare extracted feature qual-ity among descriptors such as ST+NCC, SURF, ASIFT andadaptive ST+NCC, ASIFT. In addition, mosaics generatedfrom different descriptors also establish the efficiency of our

Figure 3. Image Transformation Model in VMZ. Row A depicts graphcutsegmentation of a video sequence into shots which are seperated by redshot boundary. Row B shows the sequence of frames in a single shot whichshould generate a mini-mosaic. A shot is divided into m groups: Group1,Group2,..., Groupm (each color represents a different group, row C) toavoid error accumulation in the process. First frame of each group is calledthe reference (dark colored frame, row D) while other frames in the groupare shown in corresponding light color (row D). A single arrow from eachframe to the reference in a group denotes direct mapping. We get set oftransformed frames with reference coordinate system (represented by samecolor: dark blue for Group1 in row E) from each group. Next, we applyreference-to-reference frame mapping (row E). This results into a singlemini-mosaic with base/first frame coordinate system (row F).

adaptive descriptor.

2. Method

2.1. VMZ Algorithm

Video Mosaicking and Summarization (VMZ) is afeature-based registration algorithm. VMZ block diagram isshown in Figure 1. The main steps in VMZ are describedbelow:

• Preprocessing: it includes identifying bad frames andremoving it, deinterlacing for better image quality,image preprocessing such as histogram matching,dehazing for improving visibility of aerial imagery.Input for preprocessing step is a single image whileoutput is a modified image or removal of that imagefrom the sequence.

• Shot detection based on Graph-cut: it is the pro-cess of temporal segmentation of a long video se-quence into few continuous video fragments. Basedon cumulative histogram differences of consecutiveframes, graph-cut finds few clusters of images byminimizing energy in Eq. 1 whereP : Finite set of sites/pixels (frames for us)

Page 3: Performance Evaluation of Feature Descriptors for Aerial ...

L: Finite set of labelsl: Labelling assignments of labels in L to pixels inPp, q: individual pixelslp: label of pixel pl : set of all label-pixel assignmentsl = lp | p in P .D(p, lp): cost of assigning label lp to pixel p.

E(I) =∑p

D(p, Ip) +∑p

∑q

Vpq(Ip, Iq) (1)

The input for shot detection algorithm is a wholevideo and either 2 or 3 clusters are generated af-ter the execution of this step. In 2-clusters case,cluster0 stands for within shot (very small changebetween two consecutive frames) and cluster1stands for shot boundary (slow or abrupt change).In 3-clusters case, cluster0 stands for within shot,cluster1 stands for slow change and cluster2stands for abrupt or sudden change. A continuousset of images from cluster0 form a shot (LineA in Figure 3). For example, VIRAT dataset09152008flight2tape1 6 results into 11 shots from9291 frames. Later, from each shot, one mini-mosaicis generated.

• Mini-mosaic generation: the core step of VMZ ismini-mosaic generation which is shown in detail inFigure 3 Line B-F. Input of this step is a single shotand output is a mini-mosaic. At first a shot is dividedinto few groups (Line C). The first frame of eachgroup is called reference and subsequent frames areadded to a group unless number of frames is morethan 40 or the motion between reference and currentframe is greater than a threshold. If either of thesetwo conditions is violated, a new group is formedwith new reference frame. In a group, every frameis directly mapped to the reference (Line D). Next,consecutive reference frames are mapped (Line E)and finally mini-mosaic is produced (Line F) in thecoordinate system of first (also called as base) frameof the shot.

• Meta-mosaic generation: the final step of VMZis building the meta-mosaic which takes all mini-mosaics as input and supposed to generate one meta-mosaic fusing the content of mini-mosaics. This stepis in developing phase.

2.2. VMZ Decision Modules

There are mainly 4 submodules in VMZ decision engineas shown in Figure 2.

• Resource Management: it handles SWaPT constraintsuch as energy, time, memory footprint, preision re-quirement, priority, architecture selection, and trade-off management.

(a) Fr. 000000 (b) st+ncc=120 (c) surf=569 (d) asift=1363

(e) Fr. 001369 (f) st+ncc=112 (g) surf=338 (h) asift=632

Figure 4. Feature distribution of descriptors in VMZ for two exam-ple frames from VIRAT [] (top row: Frame 000000 and bottom row:Frame 001369). First column shows the original frame and second, thirdand bottom row showcase keypoints or features detected from ST+NCC,SURF and ASIFT descriptors respectively. No of features extracted fromST+NCC, SURF and ASIFT is 120, 569, 1369 for Frame 000000 and 112,338 and 632 for Frame 001369.

• Feature Management: the focus of this module isfeature detection, matching, parameter tuning, etc.This is the most important step for this study and toensure processing efficiency and matching accuracy,VMZ uses multiple types of features with differentcomputational complexities and strengths.Figure 4 showcases the performance of differ-ent feature descriptors (ST+NCC [28], SURF[30], ASIFT [31]) in VMZ for two images se-lected from dataset 09152008flight2tape1 6 (toprow: Frame 000000, bottom row: Frame 001329).For the first image, it can be noted that ST+NCCfeatures are well distributed across the image withtotal 120 keypoints while SURF and ASIFT find 569and 1363 features respectively which are compara-tively clustered. Interestingly, for the second imagewhich is mostly occluded, ST+NCC still finds fea-tures which are distributed across whole image evenin occluded part. This becomes the main drawbackof ST+NCC descriptor which later misguides theregistration process with false negative matches inoccluded region. On the other hand, SURF andASIFT perform much better though number of fea-tures drops compared to first frame.

• Reference or KeyFrame Hypervisor: this modulemonitors motion activity in the scene, matchingperformance of the method and also deals withreference frame management (i.e. when to start anew reference frame) and canvas management.

• Homography and Registration: VMZ uses DirectLinear Transformation (DLT), RANSAC to estimatehomography between two frames, warps frame tonew coordinate system, blends new frame to mosaic.Given the scene is approximately planner, two im-ages I(x, y, t) and I(u, v, t − k) can be related byhomogeneous relationship where the first two pa-rameters, x and y, in I denote spatial coordinates andthird one, t, corresponds to time of image capturing.

Page 4: Performance Evaluation of Feature Descriptors for Aerial ...

Figure 5. Adaptive Descriptor in VMZ: detail of Line D and E from Figure3 where each color set represents one group from a shot. Top row or Line Dshows the use of ST+NCC descriptor to map every frame to the referenceof a group. Bottom row or Line E shows the use of ASIFT descriptor tomatch consecutive references.

u =a1x+ a2y + a0c1x+ c2y + 1

(2)

v =b1x+ b2y + b0c1x+ c2y + 1

(3)

For blending, a simple technique called pixel re-placement is used where in overlapping region pixelsfrom previous warped image are replaced by pixelsfrom next warped image.

2.3. Fused Descriptor: ST+NCC, ASIFT

VIRAT is a challenging dataset with few shots havingoccluded frames. In such cases, fast and efficient descriptormight not find best matched between two frames. ThoughASIFT is comparatively robust, it is extremely time consum-ing. Thus to maintain a balance between performance andefficiency, we use a robust fused descriptor which combinesthe efficiency of ST+NCC and SURF while preserves therobustness of ASIFT.

The idea of fused descriptor is very simple as demon-strated in Figure 5. We start with the first/base frame ofa shot. The base frame is also considered as the referenceof the first group. Each subsequent frame is added to thegroup and ST+NCC descriptor is used to match it with thereference (Line D Figure 5). If number of frames in agroup is more than 40 or motion between current frame toreference is larger than a threshold, a new group is formed.

Next consecutive reference frames are mapped to eachother in order to align with base frame. For this mappingASIFT is applied as shown in Line E Figure 5. This processensures the efficiency of VMZ as ASIFT is not used unlessnecessary.

3. Experimental Result

We present mini-mosaic produced from VMZwith different feature descriptors for VIRAT dataset09152008flight2tape1 6 and 09152008flight2tape2 8. Tounderstand strength and weakness of each descriptor, wedesign few test cases. For example, test case A representsa scenario where each descriptor performs equally for ashot with very small number of frames. For test case B, we

(a) st+ncc (b) surf (c) asift (d) st+ncc, asift

Figure 6. Frame no. vs no. of features obatined from different descriptorsfor shot1 (VIRAT: 09152008flight2tape1 6). 1st, 2nd, 3rd and 4th row:ST+NCC, SURF, ASIFT, and adaptive ST+NCC, ASIFT. Bottom rowpresents mini-mosaics from 4 descriptors.

pick a shot with large number of frames and check howST+NCC works and compare the result with our fuseddescriptor. Finally, we pick a shot with occluded regionwhich is test case C to compare the pefromance of SURFwith adaptive descrptor.

3.1. Test Case A

First, we pick an easy test case of shot1 (#frames=82)from dataset 09152008flight2tape1 6. All the frames of thisshot are rich in distinctive features and there is no occlusion.We investigate keypoints obtained from different descriptorsin Figure 6 where x-axis represents number of images ina shot and y-axis stands for number of feature matchedbetween current and corresponding reference frame.

Top row shows the result from ST+NCC features withNCC block step set to 40 (block step is a variable inST+NCC and is inversly proportional to number of features).It can be observed that number of features in each frame isalmost similar (around 120). This can be explained fromthe fact that ST+NCC descriptor is designed in such a waythat it finds features even in textureless and homogeneousregion. Thus it will always find features in a frame even itis featureless. Mini-mosaic from this descriptor is shown in1st column of bottom row.

2nd row represents features from SURF descriptor withvariable MetricThreshold set to 500. The default value forthis variable is 1000 and it is set to 500 in order to increasenumber of features extracted. With SURF, feature matchingfollows certain criteria. In a group, it is highest for the framenext to reference frame as these two frames are consecutive.Then number of feature matching decreases as frames gofar away from reference until a new group is started. Mini-mosaic from this descriptor is shown in 2nd column ofbottom row.

Page 5: Performance Evaluation of Feature Descriptors for Aerial ...

(a) st+ncc(40) (b) st+ncc(80) (c) st+ncc(40), asift

Figure 7. Frame no. vs no. of features obatined from different descriptorsfor shot11 (VIRAT: 09152008flight2tape1 6). 1st, 2nd and 3rd: ST+NCC(block step = 40), ST+NCC (block step = 80), and adaptive ST+NCC (blockstep = 40), ASIFT. Bottom row presents mini-mosaics from correspondingdescriptors.

Next, we present matched features from ASIFT descrip-tor. ASIFT descriptor is more robust compared to ST+NCCand SURF. Hence feature matching is pretty consistentregardless of position of reference and current frame. Mini-mosaic from this descriptor is shown in 3rd column ofbottom row.

Finally, 4th row of Figure 6 shows the adaptive descrip-tor performance where ST+NCC is used for matching anyframe in a group to its reference (shown in red bars) andASIFT is used for reference-to-reference matching (shownin blue bars). Mini-mosaic from this descriptor is shown in4th column of bottom row.

This is a easy test case where number of frames is nottoo large and there is not any obvious problem with theframes. Consequently, mini-mosaics gerenrated from shot1look almost identical in all 4-cases (bottom row).

3.2. Test Case B

Next test case is designed for shot11 (#frames=1920)from dataset 09152008flight2tape1 6. ST+NCC features areextracted with two different NCC block steps/sizes. First,NCC block step is set to 40 and number of matched featuresis displayed in top row of Figure 7. Like shot1, numberof matched features are almost similar (around 120) forall images. Though we can observe lots of frames areskipped during final homography estimation. As numberof frames increases, error accumulation along homographychain increases. Consequently, homography matrices forlater frames of the shot drift a lot from ground truth. Conse-quently, those frames are excluded from mosaicking (shownin magenta circle) and obtained mini-mosaic becomes asmaller part of expected mini-mosaic. In this experiment,

(a) surf(500) (b) surf(1000) (c) st+ncc(40), asift

Figure 8. Frame no. vs no. of features obatined from different descriptorsfor shot4 (VIRAT: 09152008flight2tape1 6). 1st, 2nd and 3rd: SURF(MetricThreshold = 500), SURF (MetricThreshold = 1000), and adaptiveST+NCC (block size = 40), ASIFT. Bottom row presents mini-mosaicsfrom corresponding descriptors.

995 frames out of 1920 are skipped. Mini-mosaic from thisdescriptor is shown in 1st column of bottom row.

Later, we increased NCC block step to 80 and foundthat number of matched features per frame is decreased toan average of 25 though number of extracted features isdesigned to be more than 100. As a result, %inliers duringRANSAC becomes very low for some images and thoseare skipped (shown in black circle). In addition, numberof skipped frames due to bad homography increases too.Hence, total 1243 frames out of 1920 are skipped confirmingthis experiment even worse. Mini-mosaic from this descrip-tor is shown in 2nd column of bottom row.

Finally, we compared performance of fused descrip-tor (ST+NCC, ASIFT) with ST+NCC alone. Here, NCCblock step is set to 40 with ASIFT for reference to refer-ence matching. Consequently, we get good %inliers duringRANSAC and reasonably well homography estimation re-sulting in zero frame being skipped. Mini-mosaic from thisdescriptor is shown in 3rd column of bottom row.

Thus, it can be concluded that for shots with largenumber of frames, the adaptive descriptor performs betterthan ST+NCC descriptor.

3.3. Test Case C

For this experiment, we pick shot4 from dataset09152008flight2tape1 6 and the result is shown in Figure8. We choose two different MetricThreshold values forSURF. The default value is 1000 and corresponding graph isshown in middle row. As all frames of this shot is partiallyoccluded, number of matched features is very poor. Thus,11 frames out of 55 are skipped due to low %inliers during

Page 6: Performance Evaluation of Feature Descriptors for Aerial ...

TABLE 1. NO. OF SKIPPED FRAMES IN DIFFERENT DESCRIPTORS FORDATASET 09152008FLIGHT2TAPE2 8 AND 09152008FLIGHT2TAPE1 6

(THOSE WITH *)

shot #frames ST+NCC SURF ASIFT ST+NCC, ASIFTshot1* 82 0 0 0 0shot1 1312 0 0 0 0shot2 258 0 0 0 0shot3 649 0 0 0 0shot4* 55 0 8 0 0shot4 916 0 0 0 0shot5 742 0 0 0 0shot6 740 111 0 0 0shot7 109 0 0 0 0shot8* 301 0 6 0 0shot8 48 0 0 0 0shot9 699 0 0 0 0shot10 1920 0 0 0 0shot11 1920 995 0 0 0

RANSAC estimation (shown in black circles). Mini-mosaicfrom this descriptor is shown in 2nd column of bottom row.

In order to increase number of features, MetricThresholdvalue in decreased to 500. Though it does not react asexpected but number of skipped frames drops to 8 due tolow %inliers during RANSAC estimation (shown in blackcircle). Mini-mosaic from this descriptor is shown in 1st

column of bottom row.Finally, we present number of matched features from the

adaptive descriptor which outperforms SURF (zero frameis skipped by ST+NCC, ASIFT). Mini-mosaic from thisdescriptor is shown in 3rd column of bottom row.

Thus, it can be concluded that for shots with occludedframes, the adaptive descriptor performs better than SURF.

3.4. Accuracy vs Performance

Mini-mosaics obtained from different descriptors areshown in Figure 10 which demonstrates that adaptivedescriptor ST+NCC, ASIFT produces better mini-mosaicsthan other two (ST+NCC and SURF). No. of skipped framesfrom each descriptor is presented in Table 1. This alsosupports the superiority of adaptive descriptor over ST+NCCand SURF. Though ASIFT performs as good as our fuseddescriptor according to the table, we prefer using the adap-tive descriptor due to slow nature of ASIFT. From Figure 9,we can observe that SURF is the fastest with 0.62 secs/framewhile ST+NCC, ASIFT and ST+NCC, ASIFT take 1.1,25.34 and 2.7 secs/frame. Thus it can be conluded that ouradaptive descriptor is robust in all cases for VIRAT datasetand its speed (9.39x faster than ASIFT) is comparable withSURF or ST+NCC.

4. Conclusions

This experiment is focused on picking a suitable featuredescriptor for aerial imagery with occlusion and other com-plexity. We found SURF to be the fastest and reasobalyrobust. Though it fails to match images with occlusionproperly. ST+NCC is comparable with SURF in speed but

Figure 9. Execution time per frame for different descriptors in log scale.

(a) st+ncc(40) (b) surf(500) (c) st+ncc(40), asift

Figure 10. Example of few more mini-mosaics for different feature descrip-tors from 09152008flight2tape2 8. Top to bottom: shot3, shot4, shot6and shot7.

it suffers from misdetection of features specially whereimages have occlusion. In addition it also performs poorlywhen number of frames increases (around 2000). ASIFT isfound to be most robust and handle occlusion as well asother complexities of VIRAT dataset though it is extremlytime consuming. Our adaptive feature shows similar perfor-mance with ASIFT and it is also comparable with SURF orST+NCC.

In future, we plan to add Deep (Learning) features toour VMZ algorithm to improve both performance and timecomplexity. Adaptive block step for ST+NCC might beuseful for dataset with different image size.

References

[1] L. G. Brown, “A survey of image registration techniques,” ACMcomputing surveys (CSUR), vol. 24, no. 4, pp. 325–376, 1992.

[2] B. Zitova and J. Flusser, “Image registration methods: a survey,”Image and vision computing, vol. 21, no. 11, pp. 977–1000, 2003.

[3] R. Kumar, H. Sawhney, S. Samarasekera, S. Hsu, H. Tao, Y. Guo,K. Hanna, A. Pope, R. Wildes, D. Hirvonen et al., “Aerial videosurveillance and exploitation,” Proceedings of the IEEE, vol. 89,no. 10, pp. 1518–1539, 2001.

[4] H. Goncalves, L. Corte-Real, and J. A. Goncalves, “Automatic imageregistration through image segmentation and sift,” IEEE Transactionson Geoscience and Remote Sensing, vol. 49, no. 7, pp. 2589–2600,2011.

Page 7: Performance Evaluation of Feature Descriptors for Aerial ...

[5] T. Yang, J. Li, J. Yu, S. Wang, and Y. Zhang, “Diverse scene stitchingfrom a large-scale aerial video dataset,” Remote Sensing, vol. 7, no. 6,pp. 6932–6949, 2015.

[6] R. Viguier, C. C. Lin, H. AliAkbarpour, F. Bunyak, S. Pankanti,G. Seetharaman, and K. Palaniappan, “Automatic video content sum-marization using geospatial mosaics of aerial imagery,” in Multimedia(ISM), 2015 IEEE International Symposium on. IEEE, 2015, pp.249–253.

[7] Z. Zhu, E. M. Riseman, A. R. Hanson, and H. Schultz, “An efficientmethod for geo-referenced video mosaicing for environmental moni-toring,” Machine vision and applications, vol. 16, no. 4, pp. 203–216,2005.

[8] Y. Lin and G. Medioni, “Map-enhanced uav image sequence registra-tion and synchronization of multiple image sequences,” in ComputerVision and Pattern Recognition, 2007. CVPR’07. IEEE Conferenceon. IEEE, 2007, pp. 1–7.

[9] E. Molina and Z. Zhu, “Persistent aerial video registration and fastmulti-view mosaicing,” IEEE Transactions on Image Processing,vol. 23, no. 5, pp. 2184–2192, 2014.

[10] H. Trinh, J. Li, S. Miyazawa, J. Moreno, and S. Pankanti, “Efficientuav video event summarization,” in Pattern Recognition (ICPR), 201221st International Conference on. IEEE, 2012, pp. 2226–2229.

[11] R. Aktar, V. S. Prasath, H. Aliakbarpour, U. Sampathkumar,G. Seetharaman, and K. Palaniappan, “Video haze removal and pois-son blending based mini-mosaics for wide area motion imagery,” inApplied Imagery Pattern Recognition Workshop (AIPR), 2016 IEEE.IEEE, 2016, pp. 1–7.

[12] R. Aktar, H. AliAkbarpour, F. Bunyak, T. Kazic, G. Seetharaman, andK. Palaniappan, “Geospatial content summarization of uav aerial im-agery using mosaicking,” in Geospatial Informatics, Motion Imagery,and Network Analytics VIII, vol. 10645. International Society forOptics and Photonics, 2018, p. 106450I.

[13] M. Brown and D. G. Lowe, “Automatic panoramic image stitchingusing invariant features,” International journal of computer vision,vol. 74, no. 1, pp. 59–73, 2007.

[14] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” International journal of computer vision, vol. 60, no. 2, pp.91–110, 2004.

[15] Z. Hua, Y. Li, and J. Li, “Image stitch algorithm based on sift andmvsc,” in Fuzzy Systems and Knowledge Discovery (FSKD), 2010Seventh International Conference on, vol. 6. IEEE, 2010, pp. 2628–2632.

[16] Y. Li, Y. Wang, W. Huang, and Z. Zhang, “Automatic image stitchingusing sift,” in Audio, Language and Image Processing, 2008. ICALIP2008. International Conference on. IEEE, 2008, pp. 568–571.

[17] J. Xing and Z. Miao, “An improved algorithm on image stitch-ing based on sift features,” in Innovative Computing, Informationand Control, 2007. ICICIC’07. Second International Conference on.IEEE, 2007, pp. 453–453.

[18] Y.-x. LI, Y. ZENG, R.-y. ZHONG, and T. GUO, “Algorithm of imagestitching based on sift feature matching [j],” Computer Technologyand Development, vol. 1, pp. 43–52, 2009.

[19] L. Juan and G. Oubong, “Surf applied in panorama image stitching,”in Image Processing Theory Tools and Applications (IPTA), 2010 2ndInternational Conference on. IEEE, 2010, pp. 495–499.

[20] Q. Liu and M.-y. HE, “Image stitching based on surf feature matching[j],” Measurement & Control Technology, vol. 10, p. 008, 2010.

[21] L. Zhu, Y. Wang, B. Zhao, and X. Zhang, “A fast image stitchingalgorithm based on improved surf,” in 2014 Tenth InternationalConference on Computational Intelligence and Security (CIS). IEEE,2014, pp. 171–175.

[22] M. M. Pandya, N. G. Chitaliya, and S. R. Panchal, “Accurate imageregistration using surf algorithm by increasing the matching points ofimages,” International Journal of Computer Science and Communi-cation Engineering, vol. 2, no. 1, pp. 14–19, 2013.

[23] D. J. Holtkamp and A. A. Goshtasby, “Precision registration and mo-saicking of multicamera images,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 47, no. 10, pp. 3446–3455, 2009.

[24] T. Xu, D. Ming, L. Xiao, and C. Li, “Stitching algorithm of sequenceimage based on modified klt tracker,” in Computational Intelligenceand Design (ISCID), 2012 Fifth International Symposium on, vol. 2.IEEE, 2012, pp. 46–49.

[25] S. Hu, X. Ge, and Z. Chen, “Based on corner feature klt trackpanoramic mosaic algorithm,” Journal of System Simulation, vol. 19,no. 8, pp. 1742–1744, 2007.

[26] X.-h. Yang and M. Wang, “Seamless image stitching method basedon asift,” Comput Eng, vol. 39, no. 2, pp. 241–244, 2013.

[27] V. Codreanu, F. Dong, B. Liu, J. B. Roerdink, D. Williams, P. Yang,and B. Yasar, “Gpu-asift: A fast fully affine-invariant feature ex-traction algorithm,” in High Performance Computing and Simulation(HPCS), 2013 International Conference on. IEEE, 2013, pp. 474–481.

[28] A. Hafiane, K. Palaniappan, and G. Seetharaman, “Uav-video registra-tion using block-based features,” in Geoscience and Remote SensingSymposium, 2008. IGARSS 2008. IEEE International, vol. 2. IEEE,2008, pp. II–1104.

[29] K. Palaniappan, R. M. Rao, and G. Seetharaman, “Wide-area per-sistent airborne video: Architecture and challenges,” in Distributedvideo sensor networks. Springer, 2011, pp. 349–371.

[30] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robustfeatures,” in European conference on computer vision. Springer,2006, pp. 404–417.

[31] J.-M. Morel and G. Yu, “Asift: A new framework for fully affineinvariant image comparison,” SIAM journal on imaging sciences,vol. 2, no. 2, pp. 438–469, 2009.