A COMPUTER VISION SYSTEM FOR SPACEBORNE SAFETY...

A COMPUTER VISION SYSTEM FOR SPACEBORNE SAFETY MONITORING

Faisal Qureshi1, Diego Macrini1, Desmond Chung2, James Maclean2, Sven Dickinson1, and Piotr Jasiobedzki3

1. Dept. of Computer Science, Univ. of Toronto, Toronto, Canada.

2. Dept. of Electrical and Computer Eng., Univ. of Toronto, Toronto, Canada.

3. MDRobotics Ltd., Brampton, Canada.

ABSTRACT

We propose a computer vision system, called a “visualsupervisor”, capable of visually monitoring an environ-ment with the goal of safety monitoring and task verifica-tion. We demonstrate our system in a spaceborne settingin which from monocular video sequences, it 1) tracksand recognizes objects, 2) detects workspace violations,and 3) supervises in-progress tasks for anomalous situa-tions, failures, or satisfactory progress. The system com-bines motion segmentation, object tracking, object iden-tification, and pose estimation with cognitive reasoningto construct qualitative interpretations of the scene, suchas, “An object, believed to be a satellite, appears to beon a collision course with a second object, believed tobe the Space Shuttle.” We demonstrate our frameworkon the footage of mission STS-87 as viewed from theSpace Shuttle Columbia, where the proposed frameworkcorrectly identified that the mission had failed.

Key words: Qualitative/Cognitive vision;Task monitor-ing.

1. INTRODUCTION

There are a number of space-based robotics tasks forwhich computer vision can play a critical role. Onesuch task, critical to the Canadian Space Agency’s rolein the International Space Station, involves the automaticgrasping of a known object by a robotic arm. A cali-brated vision system can provide the instantaneous posi-tion and orientation of the object with respect to the cam-era, allowing the robot to visually track and thereby graspthe object. Although there exist techniques in the com-puter vision community for computing (and tracking) thepose of a geometric object from a 2-D image (sequence),these techniques can fail under extreme lighting condi-tions, such as those found in a space-based environment.Unfortunately, such failures cannot always be readily de-tected by the tracking module. Furthermore, even when

a failure can be self-detected, automatic recovery (posecorrection) may not be possible without manual inter-vention. These task-specific modules overconstrain theworld, which is both a blessing and a curse. For when theenvironment changes unexpectedly, their myopic view ofthe world can render them helpless.

To cope with this limitation, we introduce the concept of a“visual supervisor” module that can visually monitor theenvironment with the goal of safety monitoring and taskverification. The visual supervisor can track moving ob-jects, label the objects according to prototypical classes,detect and react to events in a larger context, and may pro-vide initialization and operational constraints to the taskspecific modules. The resulting framework therefore con-sists of a hierarchy of visual behaviours, each sensing theworld at a different level of resolution. Task-specific be-haviours, such as pose estimation, tracking, and graspingsubsume lower-level behaviours, such as camera controland obstacle avoidance. The task-specific behaviours, inturn, are subsumed by the visual supervisor, which canboth monitor and govern their performance as well astheir interaction with other task-specific behaviours.

One of the highlights of the proposed scheme is that itbrings together qualitative and quantitative visual pro-cesses in a behavior-based control environment. Thiscoupling is motivated by the fact that although task-specific vision, such as CAD-based vision and tracking(Ref. 1), can yield accurate metric pose information, itcannot accommodate abnormal conditions, such as ex-treme illumination changes or unexpected objects. Amore qualitative visual “front end” that can identify andtrack the qualitative shapes of multiple objects withoutknowing their exact geometry can provide a level of ro-bustness essential for space robotics.

Operations in space have evolved for the most part witha human presence. As the capabilities of robotic systemsimprove, they offer a viable way to reduce the costs andrisks associated with human presence in space. Groundcontrolled, semi- and fully autonomous space systemswill require complex vision systems that can operate un-

Proc. of 'The 8th International Symposium on Artifical Intelligence, Robotics and Automation in Space - iSAIRAS’, Munich, Germany.5-8 September 2005, (ESA SP-603, August 2005)

der space illumination, as well as a large range of dis-tances and viewpoints. Current space vision systems aredesigned for single functions only and rely on operatorassistance. Future vision systems will incorporate differ-ent strategies (modules), use different modalities, moni-tor their operation, and function autonomously (Ref. 2).Autonomy is the key requirement for future space vi-sion/robotics applications, such as those listed below:

• Robotic applications for reuseable space vehicles

• On-orbit satellite servicing

• Planetary exploration

• Servicing of scientific payloads

• Support to space infrastructure development

• Utilization and maintenance of existing and futurespace infrastructure

Furthermore, as reliance on autonomous controllers in-creases, visual supervisors, such as the one describedin this paper, become increasingly relevant. The visualsupervisor will not only monitor the task-specific visualprocesses, but also monitor the interaction of these pro-cesses. For a robot mounted on the space station tograsp an object, for example, one might envision a vi-sual supervisor whose field of view not only contains therobot’s workspace but also the surrounding region. Fromthis vantage, the visual supervisor can not only examinethe progress of the task involving the robot, but also de-tect if an unauthorized object is about to encroach on itsworkspace. The visual supervisor thus provides a visual“sanity check” on what’s happening in the environment.

We demonstrate our visual supervisor framework on thefootage of mission STS-87— an actual satellite acquisi-tion task—as viewed from the Space Shuttle Columbia.The system correctly observes and reports the anomalouscondition that the acquisition has failed.

The paper is organized as follows. We begin by present-ing the system overview in Section 2. Then in Section 3,we explain the motion segmentation and tracking mod-ule. Section 4 describes the object representation andmatching scheme. In Section 5, we introduce our spatio-temporal reasoning scheme for object identification. Sec-tion 6 introduces task hypothesis generation and verifi-cation scheme. We conclude the paper with results andconclusions in sections 7 and 8, respectively.

2. SYSTEM OVERVIEW

The visual supervisor follows a hierarchical visual pro-cessing paradigm where low-level visual routines, such asmotion segmentation and tracking, support higher-levelprocessing that involves spatio-temporal reasoning (Fig-ure 1). A long-standing challenge in computer vision

Motion Tracking & Object Segmentation

Shock graph Construction

Graph Indexing & MatchingObject View Database

Object Reasoning

Task ReasoningActive Task Database

Motion Trends

Figure 1. The Visual Supervisor: An Overview

is to successfully combine low-level visual routines withknowledge-based reasoning to construct high-level qual-itative interpretations of a scene that go beyond the usualinterpretations, such as background-forground separationor motion tracking. We address this challenge by employ-ing CoCo—an agent control framework that combinesreactivity and deliberation by managing the cogitation-action cycle (Ref. 3).

Due to extreme lighting variation in a space environment,appearance-based recognition modules are ineffective forobject identification. The only stable feature that we canexploit for object identification and pose estimation is theshape of an object’s silhouette. Our framework thereforebegins by recovering the silhouette boundary of a movingobject, and then recognizes the object based on the quali-tative shape of the silhouette. We will describe these twomodules in greater detail in the following sections.

The first step involves tracking regions of coherent mo-tion, corresponding to multiple, independently movingobjects, against a possibly moving background, in thefield of view. We then construct a part-based represen-tation consisting of a collection of qualitatively definedparts, called a shock graph, from the bounding contourof a given coherently moving region. Next, an integratedgraph indexing/matching module matches the recoveredshock graph against a database of shock graph “views” ofknown objects, resulting in a ranked list of object-viewhypotheses.

The reasoning module uses the available motion historyand object pose hypotheses and employs weak, domain-independent notions of spatial and temporal coherence toquickly identify objects and characterize their motions.Given a priori knowledge of the currently active tasks,the system attempts to compare the observed object be-havior with the expected object behavior. Unrecogniz-able objects or unexpected object motions can be used toissue warnings and invoke task specific routines to furtherinvestigate the anomalies.

Motion

Boundary Feedback

Region ConstraintFeedforward

Boundary

Motion-Based Image Warp

Region Based Constraints

Representation

Intensity

Figure 1. This figure illustrates the basic struc-ture of our approach. Region-based informationis used to derive flow constraints, which are typ-ically sparse. This information can be made moredense by warping image pixels to find pixels withmatching intensities in the two images. These con-straints provide data for the active contour, andthe contour is warped between frames accordingto the motion parameters. The contour reinforcesspatial coherence by allowing us to consider onlymotion constraints within the contour.

warping the image according to the recovered motion, andcomparing the intensities of corresponding pixels, as de-scribed in Section 2.2. Matching intensity values providesa sufficiently dense set of constraints to initialize an activecontour whose shape is then governed by both motion andimage gradient information, as described in Section 2.3. Insubsequent frames, motion information is used to warp therecovered boundary (active contour) to the next frame in amotion feedforward step, described in Section 2.4.

Finally, the spatial coherence of the object boundary in-fluences the next round of motion estimation by exclud-ing motion constraints that fall outside the boundary. Thisboundary feedback step is described in Section 2.5.

2.1. Motion Estimation

Our approach begins with a multiscale layered motionestimation technique that is similar to the methods de-scribed in [10, 18]. In particular, we adopt a robust mixturemodel approach related to that described in [12], but in prin-ciple, any other layered motion approach that generates ei-ther a parametric motion or a dense, non-parametric motioncan be used instead. The process is described here for ap-plication at a single scale to simplify its presentation. Thebrightness constancy constraint (BCC), !T

!x I!u + It = 0,is well-known, and is the starting point for the estimation of2-D image velocity !u = [ux uy]T . Each image location pro-vides a constraint vector !c(!x) = [Ix Iy It]T that satisfies (inthe absence of noise) !c(!x)T !uh = 0, where !uh = [!uT 1]T isa homogeneous representation of !u, and !u is assumed to bethe motion experienced by the pixel at !x = [x y]T . The in-ner product of a constraint with a motion vector from an-other layer will be expected to yield a non-zero value, ingeneral.

Each motion layer has an associated parametric modelthat is either constant or affine, although any parametricmodel can be used. Associated with each parametric modeland its associated parameters !" is a likelihood functionp(!c(!x)|!", !x) that indicates how well a constraint !c matchesthe motion. For example, for the constant motion model,we have pconstant(!c(!x)|!u,#) = G(!c(!x)T !uh; 0,#). Here,the notation G(x;µ,#) represents a Gaussian density overx with mean µ and standard deviation #. The likelihood ofa particular constraint !c(!x) with respect to all motion lay-ers is p(!c(!x)) =

!nj=1 $jpj(!c(!x)|!"j , !x), where the $j are

called mixing parameters and satisfy 0 " $j " 1 for allj and

!j $j = 1. An outlier layer is used to model con-

straints not accounted for by other motion layers.The probability that a constraint comes from any partic-

ular layer j can be computed as

O(!c(!x)|j) =$jpj(!c(!x)|!"j), !x)

!nk=1 $kpk(!c(!x)|!"k), !x))

(1)

and is called its ownership by that layer. Finally, the likeli-hood of the entire model with respect to a set of measuredconstraints {!c(!xq)}m

q=1 is given by

L(!!) = "mq=1p(!c(!xq)) , (2)

where !! = [!"T1 . . . !"T

n $1 . . .$n]T is the collection of allmodel parameters. The EM algorithm [6] is an iterativetechnique for maximizing a model’s likelihood with respectto the observed data. Since, in general, the likelihood is anon-linear function, the method may find a local minimumas opposed to the desired global minimum.

Each iteration of the EM algorithm requires that thenumber of models n be known, so it is necessary to de-termine this from the input sequence. Following [16], we

Figure 2. Region-based information is used to derive flow con-straints, which provide data for the active contour. The contour,in turn, is warped between frames according to the motion pa-rameters. The contour reinforces spatial coherence by consid-ering only those motion constraints that lie within the contour.

3. MOTION SEGMENTATION AND OBJECTTRACKING

The tracking module (Ref. 4) estimates spatially coher-ent motion by combining region (layered motion estima-tion) and boundary information (active contour estima-tion). The tracking scheme employed here assumes noa priori knowledge about the number of regions to beestimated or the shape of the regions. Furthermore, itdoes not assume a static background or a stationary cam-era. These attributes make the motion estimation and ob-ject tracking scheme used here particularly suitable fora space environment, where color-based object trackingschemes might fail due to harsh lighting situations.

The tracking module begins by estimating a sparse setof motion constraints in successive frames by assum-ing brightness constancy. These motion constraints arethen classified according to a constant or affine paramet-ric model using the Expectation Maximization (EM) al-gorithm. Dense segmentation constraints, which are re-quired to initialize an active contour, are generated bywarping the images according to the recovered motionand evaluating the match of pixel intensities. Once initial-ized, the shape of the active contour is governed by bothmotion and image gradient information. The spatiallycoherent active contour influences motion estimation insubsequent frames by excluding motion constraints thatdo not lie within the contour, yielding a synergy betweenregion- and boundary-based motion estimation.

4. SHAPE REPRESENTATION AND MATCHING

The object views emerging from the tracking processmust be matched against a model database to determinetheir identity and pose. Toward this end, we representthe silhouette of each view by a shock graph. A shockgraph is a shape abstraction that decomposes the skeletonof a silhouette into a set of hierarchically organized prim-itive parts (Ref. 5). This hierarchy is given as a directedacyclic graph (DAG), in which node attributes encode lo-cal shape properties, such as the radii and relative posi-tions of the skeleton points. The problem of comparingquery views with model views is then solved by matchingthe query graphs against a set of candidate graphs fromthe model database.

If there were no noise, our problem could be formu-lated as a graph isomorphism problem for vertex-labeledgraphs. Unfortunately, with the presence of significantnoise, in the form of the addition and/or deletion of graphstructure, large isomorphic subgraphs may simply not ex-ist. We overcome this problem by proposing a matchingalgorithm that accounts for local similarity measures ofgraph topology and node attributes. In turn, our char-acterization of graph topology becomes the discrimina-tive feature in a fast screening mechanism for pruning thedatabase down to a tractable number of candidates. De-tails of the pruning algorithm can be found in (Ref. 6).

We draw on the eigenspace of a graph to characterizethe topology of a DAG with a low-dimensional vector.The eigenvalues of a graph’s adjacency matrix encodeimportant structural properties of the graph, characteriz-ing the degree distribution of its nodes. Moreover, wehave shown that the magnitudes of the eigenvalues arestable with respect to minor perturbations of graph struc-ture due to, for example, noise, segmentation error, orminor within-class structural variation (Ref. 6). For ev-ery rooted directed acyclic subgraph (i.e., a shape part)of the original DAG, we compute a function of the eigen-values of the subgraph’s antisymmetric {0, 1,−1} node-adjacency matrix, which yields a low-dimensional topo-logical signature vector (TSV) encoding of the structureof the subgraph. Details of the TSV, along with an analy-sis of its stability, can be found in (Ref. 6).

Each node in a graph (query or model) is assigned a TSV,which reflects the underlying structure in the subgraphrooted at that node. This encoding of graph topology al-lows us to discard all the edges in our two graphs andthen solve the matching problem by finding the best cor-responding nodes; two nodes are said to be in close cor-respondence if the distance between their TSVs, and thedistance between their node attributes is small. Such aformulation amounts to finding the maximum cardinality,minimum weight matching in a bipartite graph spanningthe two sets of nodes. We combine the above bipartitematching formulation with a greedy, best-first search in arecursive procedure to compute the corresponding nodesin two rooted DAGs which, in turn, yields an overall sim-ilarity measure that is used to rank the candidates. An

Q M i

Q c M ic Q * M i

*Q M i

Q M i Q M i

(a) (b)

(d) (e)

(c)

Figure 3. The DAG matching algorithm. (a) Given aquery graph and a model graph, (b) form bipartite graphin which the edge weights are the pair-wise node similari-ties. Then, (c) compute a maximum matching and add thebest edge to the solution set. Finally, (d) split the graphsat the matched nodes and (e) recursively descend.

overview of the algorithm is depicted in Figure 3, and thedetails can be found in (Ref. 7, 8).

5. OBJECT REASONING

A fundamental requirement for the proper function of thevisual supervisor is robust object identification. The vi-sual supervisor must be able to positively identify ob-jects in the field of view given the shape matching re-sults, recover from short-term segmentation/tracking fail-ures, and handle object occlusions. To meet these re-quirements, the visual supervisor maintains an abstractedmental model of the scene, which is continuously updatedto reflect the passage of time and the latest results fromthe visual processing routines. Application of spatio-temporal rules upon the mental state yields the list of ob-jects believed to be in the field of view.

The shock graph-based shape matching module treatseach frame independently and returns a ranked list ofobject-view hypotheses for each image region that ex-hibits coherent motion as estimated by the motion seg-mentation and object tracking routines (Figure 4). De-pending upon the quality of the bounding contour and thenumber of objects in the object-view database, the hy-potheses might be ranked incorrectly, or worse, do noteven include the target object. What further compoundsthe problem of choosing a candidate object for the regionis the fact that object-pose hypotheses for a given regionof motion can vary considerably between frames. Theobject reasoning module accounts for the noisy nature ofmotion segmentation and object tracking by maintaininga small, manageable number of candidate objects, insteadof committing to a single object for a given region.

The object reasoning module examines object-view hy-potheses across multiple frames to select candidate ob-

INDEXING RESULTS

Rank Object Name View # Similarity

0 fuzzy_spartan 315 1 1 fuzzy_spartan 225 0.868278 2 fuzzy_spartan 270 0.868278 3 fuzzy_spartan 135 0.868278 4 fuzzy_spartan 45 0.759339 5 BAT 16 0.546703 6 fuzzy_spartan 315 0.546703 7 fuzzy_spartan 0 0.472085 8 fuzzy_spartan 0 0.414933 9 KNIFECLV 77 0.414933 10 BAT 22 0.400738 11 BAT 81 0.400738 12 BAT 101 0.400738

MATCHING RESULTS


0 fuzzy_spartan 45 0.753641 1 fuzzy_spartan 135 0.546489 2 dog 46 0.538645 3 pig 42 0.525449 4 fuzzy_spartan 0 0.515806 5 dog 71 0.501153 6 ESPRESSO 3 0.490926 7 camel 127 0.487582 8 camel 70 0.473799 9 dog 47 0.467701 10 pig 87 0.464735 11 fuzzy_spartan 225 0.454066 12 HORSE 86 0.449975 13 fuzzy_spartan 0 0.435783 14 camel 22 0.428474 15 fuzzy_spartan 0 0.410162

INDEXING RESULTS


0 fuzzy_spartan 315 1 1 fuzzy_spartan 225 0.868278 2 fuzzy_spartan 270 0.868278 3 fuzzy_spartan 135 0.868278 4 fuzzy_spartan 45 0.759339 5 BAT 16 0.546703 6 fuzzy_spartan 315 0.546703 7 fuzzy_spartan 0 0.472085 8 fuzzy_spartan 0 0.414933 9 KNIFECLV 77 0.414933 10 BAT 22 0.400738 11 BAT 81 0.400738 12 BAT 101 0.400738

MATCHING RESULTS


0 fuzzy_spartan 45 0.753641 1 fuzzy_spartan 135 0.546489 2 dog 46 0.538645 3 pig 42 0.525449 4 fuzzy_spartan 0 0.515806 5 dog 71 0.501153 6 ESPRESSO 3 0.490926 7 camel 127 0.487582 8 camel 70 0.473799 9 dog 47 0.467701 10 pig 87 0.464735 11 fuzzy_spartan 225 0.454066 12 HORSE 86 0.449975 13 fuzzy_spartan 0 0.435783 14 camel 22 0.428474 15 fuzzy_spartan 0 0.410162

Figure 4. Object pose hypotheses with similarity value of morethan 0.4 for Region 1 in the video frame. The list of hypothesesalso contains objects that clearly do not exist in space, which isdue to the fact that models for these objects are present in theshape matching database. The models were added intentionallyto challenge the shape matching and object reasoning modules.

jects for a given region. It begins by accumulating votesfor each hypothesized object for a given region over mul-tiple frames. Objects that occur more frequently receivehigher votes, and those above a certain threshold are keptas possible candidates for the region. The success of ourvoting scheme depends on the assumption that inaccura-cies in object hypotheses are temporally inconsistent andunstable, i.e., a given region will not be assigned the sameincorrect object across multiple frames. We observed em-pirically that the above assumption holds true in our set-ting.

A naive voting scheme that involves counting the numberof occurrences of a hypothesized object is not enough,as it only takes into account the frequency of occurrencewithout paying any attention to the pattern of occurrence(Figure 5(a)). We instead propose a weighted, occlusionsensitive, and time-discounted voting scheme, which for-malizes the notion that 1) recurrent and temporally stablehypotheses are more likely, 2) objects that have not beenhypothesized for the last few frames should be forgotten,and 3) objects believed to be occluded should retain theirvalues at least for some time in the future (Figure 5(b)).

The number of candidates for a given region can be fur-ther reduced by employing domain specific knowledgeincluding reasoning about object pose, invoking objectspecific recognition routines, or querying a human opera-tor. Object pose reasoning involve enumerating all possi-bilities within object pose space, computing a domain-dependent cost for each possibility, and keeping onlypossibilities that have low costs (Figure 5(c)). Reason-ing about an object’s pose requires guarantees about thespeed at which the shape matching module operates vis-a-vis the time scale of the environment. For the currentprototype, we do not assume these guarantees, so we havenot yet implemented domain specific reasoning for objectidentification.

The object reasoning module handles object ‘entry’,‘exit’, and ‘occlusion’ events by examining the number ofmotion regions as estimated by the motion segmentation

frames 0 54321 7 8

a_1c_1d_2

d_1a_2

c_2a_4a_3

a_4a_5

b_1d_1

d_1a_7c_2

e_1b_2

b_1c_2c_3

a_1 a_2

a_4

a_3

a_4

a_5 a_7

a_7

a_4

a_5 a_7

a_7

(a)

(b) (c)

Figure 5. Object Reasoning—a hypothetical scenario. (a) Object pose hypotheses for a given region at successive frames. (c) Objects’votes accumulated over successive frames. After 8 frames, object ‘a’ appears the most likely candidate. (b) Object pose interpretationtree (object ‘a’) construction after applying an object persistence constraint. Domain specific knowledge can be used to assign a costto every path in this tree; paths with higher costs are less likely. Reasoning within the object pose space further reduces the number ofcandidate objects for a given region.

routine. An ‘entry’ event is detected when a new regionis detected through motion segmentation. The object rea-soning module avoids spurious ‘entry’ events by enforc-ing spatial coherence and a temporal stability constrainton the newly detected region over successive frames, i.e.,the new region must be reliably detected across multipleframes before it is considered a valid ‘entry’ event. An‘exit’ event is detected when an object leaves the field ofview, which results in a decrease in the number of es-timated motion regions. The number of estimated mo-tion regions also decreases during object occlusion, sothe object reasoning module employs first- and second-order spatial knowledge to distinguish between an ‘exit’and an ‘occlusion’ event. An ‘exit’ events happens whenone of the regions vanishes and none of the existing re-gions have the same spatial coordinates. This decision isfinalized over multiple frames to avoid spurious events.

The object reasoning module can also detect when twoobjects form a physical connection, e.g., during a suc-cessful satellite capture operation, and to which we referas a ‘coupled’ event. We assume that the two objects arerigidly connected after the ‘coupled’ event, so we can usethe number of estimated motion regions as a cue to detecta ‘coupled’ event. A ‘coupled’ event is detected when 1)the number of estimated regions of motion decreases and2) the resulting bounding contour resembles the union ofthe two bounding contours, one for each object, prior tothe ‘coupled’ event.

6. TASK VERIFICATION

Task verification is a primary requirement for safety mon-itoring. The idea is that the visual supervisor should notonly identify the various objects in the field of view, butalso establish that their interactions are safe. Here, the no-tion of safety is determined by the set of active tasks—thesame interaction between some objects can have vastlydifferent interpretations given the active tasks. For exam-ple, an observation that A and B are moving toward eachother should be interpreted as a possible collision in theabsence of an active ‘A joins B’ task.

Objects in the field of view determine which tasks arepossible. In the absence of any knowledge about activetasks, the problem of inferring which task is active is bothchallenging and computationally expensive due to a largenumber of potential tasks that must be reasoned about;2n tasks, for example, are possible with n objects. Thevisual supervisor, therefore, restricts the reasoning abouttasks to the list of active tasks, which can be regarded asa scheme for focusing attention to what is relevant.

The visual supervisor defines each task in terms of actionsand actors. We can describe a task in BNF form as

<task> ::= <task> <stage><stage> ::= <stage> <action><action> ::= <action> <actors><actors> ::= <object> <actors>

It simply states that a task consists of multiple stages andthat each stage defines the behaviors of, and interactionbetween, objects. This allows the visual supervisor toquickly construct a list of objects that should be in thefield of view for a given task to be active. As a first step,this list is matched against the set of objects estimated tobe in the field of view by the object reasoning module.The matching provides a quick way to rule out tasks thatare not active and to identify objects that should not bepresent in the field of view.

The matching is non-trivial due to the fact that multipleobjects are usually estimated for each region of coher-ent motion. We propose a weighted matching scheme toaddress this issue, where the strength of an object deter-mines the relevance of a match. Also, different stages ofa task might involve different objects, so in general, it isnot sufficient to match a task-wide list of objects. Thevisual supervisor instead matches the object list for eachstage against the objects estimated to be in the field ofview. This yields a set of possible stages that might beactive which, in turn, gives a hint as to what task(s) mightbe active.

Stage verification is performed by observing the interac-tion of the objects. In the current prototype, we are able toidentify the following interactions between two objects:

• Two objects are maintaining their distance.

• Two objects are converging.

• Two objects are diverging.

7. RESULTS

We demonstrate the framework on the footage of missionSTS-87 as viewed from Space Shuttle Columbia. Theobjective of this mission was to capture the Spartan satel-lite by using a robotic arm mounted on the Space Shuttle.An astronaut maneuvered the arm to bring it closer to,and eventually dock with, the satellite. This mission wasaborted when the arm collided with the satellite, sendingthe latter into a spin.

7.1. Stage 1: Motion Segmentation and ObjectTracking

We first processed the video to identify moving objects.For this video, the motion segmentation and object track-ing module correctly identified two regions of coherentmotion with bounding contours shown in Red and Cyan(Figure 6).

7.2. Stage 2: Object Matching

The estimated regions are passed to the shape matchingmodule that determines their identity and pose by match-

(a) Region 1 (b) Region 2 (c) Match Matrix

Figure 7. The estimated regions, such as those for Frame 855shown here (a)-(b), are passed on to the object matching mod-ule that returns a ranked list of object-pose hypothesis for eachregion. For Frame 855, the top five matches for Region 1 are‘satellite’, ‘arm’, ‘arm’, ‘guitar’, and ‘guitar’, and for Region2: ‘umbrella’, ‘satellite’, ‘satellite’, ‘umbrella’, and ‘satellite’.The correct match for Region 1 is ‘arm’ and for Region 2 is‘satellite’. (c) The results of matching the estimated regionsin every frame against the model database. Rows representthe regions in every frame (a total of 772 rows; 2 regions in386 frames; we processed every fifth frame starting from Frame25). Columns (633) represent the number of views in the modeldatabase. White indicates a high-similarity value.

ing the shock graph representation of a region’s boundingcontour against a model database (Figure 7). Our modeldatabase contains multiple views of 20 different objects,including 22 views of the Spartan satellite and 4 views ofthe robotic arm shown in the video. The performance ofthe shape matching module, given the fact that the modeldatabase contains 633 unique views, is a testament to therobustness of our shape matching scheme.

We created view-based models for the satellite (to beused by the shape matching routine) by rendering silhou-ette images of the 3D satellite CAD model from multi-ple views (Figure 8(a)) and then computing shock graphsfrom these images. It turns out that the 3D satellite CADmodel has minor structural detail not visible on the actualobject, which results in a shock graph that is noticeablydifferent from a shock graph that is extracted from thebounding contour of the region corresponding to the ac-tual satellite. This translates into poor performance ofthe shape matching routine. We resolve this issue by ap-plying a Gaussian filter on the silhouette image, therebybridging the representational gap between model and im-age.

We also created view-based models for the robotic armshown in the test video. Here, due to lack of a readilyavailable geometric model of the arm and given the factthat it is never fully visible in the video, we extracted asilhouette for the arm for every frame in the video, andclustered them into 4 exemplar silhouettes through K-Means clustering. We then computed shock graphs forthese exemplars and added these to the model database.

7.3. Stage 3: Object Reasoning

The object reasoning module identifies the two regionsfrom the object-view hypotheses returned from the shape

(a) Frame 25 (b) Frame 300 (c) Frame 855 (d) Frame 1890

Figure 6. Motion tracking and segmentation routine successfully estimated two regions exhibiting coherent motion and computed theirbounding contours (a)-(d).

(a) (b)

Figure 8. (a) Silhouette image by rendering a 3D satellite CADmodel; notice the fine detail. (b) The same image after Gaussiansmoothing.

matching module . It only considers hypotheses whosesimilarity value exceeds 0.4. The object reasoning mod-ule has correctly estimated Region 1 to be the satelliteand Region 2 to be the robotic arm (Figure 9).

7.4. Stage 4: Task Identification and Verification

In the current setup, the set of active tasks contains onlyone task: ‘dock’. During a ‘dock’ task, a robotic armmoves closer to the satellite until the latter is successfullycaptured. Upon a successful capture the satellite and thearm are rigidly coupled and exhibit the same motion char-acteristics. Given this description of the ‘dock’ task, wecan describe it as follows:

<dock> ::= <approach> <captured><approach> ::= <converge> <sat> <arm><captured> ::= <coupled> <sat_arm> |

<coupled> <sat> <arm>

The visual supervisor matches the objects identified inStage 3 with the actors (or objects) required for the vari-ous stages of the active task and decided that both objectsare expected to be in the field of view. In a more generalsetting when one of the objects is not expected to be inthe field of view, the visual supervisor can either raise analert or employ a more suitable routine for further visualanalysis.

The next step is to resolve the current stage of the task.In general, specialized routines are required to decide be-tween different stages; however, we were able to use therelative motion of the two objects to choose the active

stage. We label the relative motion as converging, diverg-ing, and coupled by fitting a linear model to the track-ing data and examining its slope. Two objects that ap-pear to be converging in a video do not necessarily implythat these are also coming closer in 3D, so the visual su-pervisor also estimated the relative depth of the two ob-jects by comparing their image area to the model dimen-sions. Monocular depth estimation, while not as accurateas stereo depth estimation, is sufficient for our purpose.Also, the visual supervisor can potentially employ a bet-ter depth estimation strategy when the need arises.

The visual supervisor estimated the current stage to be‘approach’, as the two objects exhibited distinct motions.The visual supervisor also estimated that the two ob-jects are converging, thereby deciding that the ‘approach’stage is progressing satisfactorily.

Around Frame 855 the two regions are almost touchingeach other, so the visual supervisor estimates that the cur-rent stage is ‘captured‘. Due to task failure, however, thesatellite starts moving away from the robotic arm and vi-sual supervisor detects that the two objects are divergingfrom each other and estimates that the ‘captured’ stage isfailing.

8. CONCLUSIONS

We have proposed a visual supervisor framework capa-ble of constructing qualitative interpretations of a scenefrom monocular videos with the aim of safety monitor-ing. The framework brings together motion segmenta-tion, object tracking, view-based shape matching, and anew spatio-temporal reasoning scheme within an agentcontrol framework. The object reasoning module can cor-rectly identify visual events, such as ‘entry’, ‘exit’, ‘oc-clusion’, and ‘coupled’. Our task description scheme al-lows task hypotheses to be generated from the identifiedobjects. Low-level visual routines can then by used fortask recognition and monitoring. We have demonstratedthe framework on the footage from mission STS-87 andthe initial results appear promising.

We envision that visual supervisors, such as the one de-scribed here, will become more relevant as an increasingnumber of space missions are performed autonomouslyby intelligent controllers without a human in the loop. As

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Region 1: Object Confidence

Con

fiden

ce

Frames

Spartan Satellite

Espresso Cup

SeahorseLamp

Umbrella

0 50 100 150 200 250 300 350 4000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Region 2: Object Confidence

Con

fiden

ce

Frames

Robot ArmGuitarBat

(a) (b)

Figure 9. Voting results for possible candidates for Region 1 (a) and Region 2 (b). The object reasoning module correctly inferredRegion 1 to be a satellite and Region 2 to be a robotic arm. We only show candidates whose vote at any point during the sequenceexceeds 0.03.

such, the framework introduced here is a step in this di-rection.

There are a number of things that we want to addressin the future. First, we want to evaluate the visual su-pervisor on monocular videos from multiple missionswhich, among other things, requires us to include view-based models of some other objects in the shape match-ing database. Next, we want to investigate more sophis-ticated task recognition and monitoring schemes. Mostimportantly, we want to close the loop between the rea-soning module and the low-level visual routines, whichwould allow the reasoning module to guide the low-levelvisual routines, making them more competent and robust.

ACKNOWLEDGMENTS

The authors wish to gratefully acknowledge the finan-cial support of the National Sciences and EngineeringResearch Council of Canada (NSERC) and Communica-tions and Information Technology Ontario (CITO). Theauthors would also like to acknowledge the earlier con-tribution of Neil Witcomb to the project.

REFERENCES

1. Sminchisescu, C., Metaxas, D., and Dickinson, S.Incremental model-based estimation using geomet-ric constraints. IEEE Transactions On Pattern Anal-ysis and Machine Intelligence, 27(5):727–738, july2005.

2. Jasiobedzki, P. and Anders, C. Computer visionfor space robotics: Applications, role, and perfor-

mance. In SPRO ’98 I IFAC Workshop on SpaceRobotics, Canadian Space Agency, pages 96–103.

3. Qureshi, F. Z., Terzopoulos, D., and Gillett, R. Thecognitive controller: a hybrid, deliberative/reactivecontrol architecture for autonomous robots. In Or-chard, B., Yang, C., and Ali, M., editors, Innova-tions in Applied Artificial Intelligence. 17th Inter-national Conference on Industrial and EngineeringApplications of Artificial Intelligence and ExpertSystem (IEA/AIE 2004), volume 3029 of Lecturenotes in Artificial Intelligence, pages 1102–1111,Ottawa, Canada, May 2004. Springer-Verlag.

4. Chung, D., Maclean, W., and Dickinson, S. Inte-grating region and boundary information for im-proved spatial coherence in object tracking. InWorkshop on Articulated and Nonrigid Motion(CVPR), Washington D.C., June 2004.

5. Siddiqi, K. and Kimia, B. B. A shock grammar forrecognition. In IEEE Conference on Computer Vi-sion and Pattern Recognition, San Francisco, USA,1996.

6. Shokoufandeh, A., Macrini, D., Dickinson, S.,Siddiqi, K., and Zucker, S. Indexing hierarchi-cal structures using graph spectra. IEEE Trans-actions On Pattern Analysis and Machine Intelli-gence, 27(7):1125–1140, july 2005.

7. Siddiqi, K., Bouix, S., Tannenbaum, A., andZucker, S. The hamilton-jacobi skeleton. InIEEE International Conference on Computer Vi-sion, pages 828–834, 1999.

8. Macrini, D. Indexing and matching for view-based3-d object recognition using shock graphs. Master’sthesis, University of Toronto, 2003.

A COMPUTER VISION SYSTEM FOR SPACEBORNE SAFETY...

Documents

Transcript of A COMPUTER VISION SYSTEM FOR SPACEBORNE SAFETY...