Where is the World Headed? Trajectory Prediction for ...jmcauley/workshops/scmls... · Where is the...

Where is the World Headed?Trajectory Prediction for Interacting Agents

Nitin Kamra1, Hao Zhu2, Dweep Trivedi1, Ming Zhang2, and Yan Liu1

University of Southern California, CA, USA1 | Peking University, Beijing, China2{nkamra,dtrivedi,yanliu.cs}@usc.edu1 | {hzhu1998,mzhang_cs}@pku.edu.cn2

Abstract

Trajectory prediction for scenes with multiple agents is an important problem fortraffic prediction, pedestrian tracking and path planning. We present a novel rela-tional neural network model to address this problem, which flexibly models interac-tion between agents by making fuzzy decisions and combining the correspondingresponses with a fuzzy operator. Our approach shows significant performance gainsover many existing state-of-the-art predictive models in diverse domains such ashuman crowd trajectories, US freeway traffic and physics datasets.

1 Introduction

Predicting trajectories of agents interacting with each other is a key challenge in many practicaldomains such as traffic prediction [17, 11], pedestrian tracking [1, 3] and path planning [13]. Modelinginteractions is challenging because: (a) interaction is dynamically varying over time and it is hardto infer precisely when two agents are interacting, and (b) changes to trajectories resulting frominteractions can be quite fuzzy. Since fuzzy decisions enter routinely into human interactions and arealso useful to represent interactions in systems of inanimate entities, we present a novel Fuzzy QueryAttention (FQA) mechanism to solve the aforementioned challenges.

Problem formulation: In all subsequent text, p denotes positions and v denotes velocities. Givena scene comprising of N agents with positions pti = (xti, y

ti) ∀i ∈ [N ] at a sequence of time-steps

t ∈ [T ]. The task is to observe all agents from time 1 to Tobs, infer their motion characteristics andongoing interactions and predict their positions for time-steps Tobs + 1 to T . We use pt without asubscript for the positions of all agents at time t, while pi without a superscript denotes the trajectoryof agent i at all time-steps (similarly for velocity).

2 Fuzzy Query Attention model

Basic approach: FQA generates pairwise attention to decide about when two agents are interactingvia dot-product based fuzzy decisions. It simultaneously learns how the agent under focus is affectedby the influencing agent given the fuzzy decisions. We demonstrate significant gains over existingbaselines on human crowds [1, 7], freeway traffic [6] and physics datasets [4, 10], thereby showingthat FQA can model diverse kinds of interactions.

FQA architecture: Our architecture (Figure 1a) takes the spatial positions pti=1:N of all agents attime t and predicts the next step positions pt+1

i=1:N . An LSTM module whose weights are sharedby all agents is used to maintain the hidden state (ht−1i ) containing the past trajectory informationfor the ith agent. We predict the next position of agent i by predicting a first-order linear velocityoffset (vti ) augmented with a non-linear correction (∆vti ): p

t+1i = pti + vti + ∆vti . The vti term

captures the linear velocity of the agent and the ∆vti captures agent’s own intentional deviations fromlinearity plus effects due to interactions with other agents. The intentional deviations are captured by apreliminary update to the LSTM’s hidden state: ht−1i → hti (in figure 1a). For modeling interactions,

Preprint. Under review.

��−1

�=1:�

��

�=1:�

Linearvelocityestimate

� ��=1:�

��

ℎ�−1

�=1:�

ℎ ��=1:�

Σ

� �+1�=1:�

Interactionmodule

Δ��

�=1:�

ℎ�

�=1:�

(a) Overall architecture

Interaction module

��=1:�

ℎ �=1:�

Δ��=1:�

ℎ�=1:�

FuzzyQuery

Attention

FC layers

��=1:�

FC layers

Pairwiseedges

(b) Interaction module

Fuzzy Query Attention

��=1:�

ℎ �=1:�

Sender-Receiverfeatures

,��

,��

,ℎ� ℎ�

,ℎ�� ℎ ��

FCFC

��

��

�( ⋆ + �)��

FClayers

FClayers

Fuzzy combination:

+ (1 − )��,�� ,��

��,��

��,��

Decisions: ��concat

FC FC

max-pool

aggregation

��=1:�� ,��

(c) FQA module

Figure 1: FQA prediction architecture

our interaction module (Figure 1b) generates edges between agents (E) and uses the FQA module togenerate attention (a) for each agent, which is used to update the LSTM state (hi) and predict thevelocity corrections (∆vi) for each agent. The FQA module (figure 1c) generates n yes-no responses(Vy,sr, Vn,sr) from sender-receiver features between agent pairs, combines them with a fuzzy if-elseaccording to n fuzzy decisions (Dsr) to model interactions between agents s and r, and aggregatesthe combinations to generate the required attention per agent.

3 Experiments

We report Root Mean Square Error between ground truth and predictions over all time steps on testsets of diverse benchmark datasets. FQA with n = 8 decisions outperforms state-of-the-art baselinesby significant margins and accurately models diverse kinds of multi-agent interactions for trajectoryprediction. Please refer to the appendix for detailed descriptions of FQA, all baselines, the datasets,experimental setup and more comprehensive results.

Table 1: Prediction RMSE for FQA, Social LSTM (SLSTM) and Graph Attention Networks (GAT)

Model ETH-UCY Collisions NGsim Charges

SLSTM 0.690 ± 0.013 0.211 ± 0.002 6.453 ± 0.153 0.485 ± 0.005GAT 0.575 ± 0.007 0.237 ± 0.001 6.100 ± 0.063 0.524 ± 0.004FQA (ours) 0.540 ± 0.006 0.176 ± 0.004 5.071 ± 0.186 0.409 ± 0.019

4 Conclusion

We have presented a novel Fuzzy Query Attention (FQA) mechanism which models pairwise attentionbetween agents by learning to make fuzzy decisions. While motivated from multi-head self-attentionintroduced in [15], the novelty of FQA lies in generalizing self-attention which applied to a singleentity, to pairwise-attention which attends to an ordered pair (sender-receiver) of entities and capturesthe interaction effects of the sender on the receiver. Further our simpler row-wise dot-product structurefits easily on a single GPU (with ≤ 12GB memory), while still retaining the strong performance ofthe dot-product attention structure. We demonstrate significant gains over existing state-of-the-artbaselines in diverse domains thereby demonstrating the potential of FQA.

2

References[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio

Savarese. Social lstm: Human trajectory prediction in crowded spaces. In IEEE Conference on ComputerVision and Pattern Recognition, pages 961–971, 2016.

[2] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, MateuszMalinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductivebiases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.

[3] Stefan Becker, Ronny Hug, Wolfgang Hübner, and Michael Arens. An evaluation of trajectory predictionapproaches and notes on the trajnet benchmark. arXiv preprint arXiv:1805.07663, 2018.

[4] Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. In International Conference on Learning Representations,2017.

[5] Benjamin Coifman and Lizhe Li. A critical evaluation of the next generation simulation (ngsim) vehicletrajectory dataset. Transportation Research Part B: Methodological, 105(C):362–377, 2017.

[6] Nachiket Deo and Mohan M Trivedi. Convolutional social pooling for vehicle trajectory prediction. InIEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1468–1476, 2018.

[7] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan: Sociallyacceptable trajectories with generative adversarial networks. In IEEE Conference on Computer Vision andPattern Recognition, pages 2255–2264, 2018.

[8] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, pages 1024–1034, 2017.

[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[10] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relationalinference for interacting systems. In International Conference on Machine Learning, pages 2693–2702,2018.

[11] Namhoon Lee, Wongun Choi, Paul Vernaza, Christopher B Choy, Philip HS Torr, and Manmohan Chan-draker. Desire: Distant future prediction in dynamic scenes with interacting agents. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 336–345, 2017.

[12] Yaguang Li, Chuizheng Meng, Cyrus Shahabi, and Yan Liu. Structure-informed graph auto-encoder forrelational inference and simulation. In ICML Workshop on Learning and Reasoning with Graph-StructuredRepresentations, 2019.

[13] Christoph Rösmann, Malte Oeljeklaus, Frank Hoffmann, and Torsten Bertram. Online trajectory predictionand planning for social robot navigation. In 2017 IEEE International Conference on Advanced IntelligentMechatronics, pages 1255–1260. IEEE, 2017.

[14] Andrea Tacchetti, H Francis Song, Pedro AM Mediano, Vinicius Zambaldi, Neil C Rabinowitz, ThoreGraepel, Matthew Botvinick, and Peter W Battaglia. Relational forward models for multi-agent learning.In International Conference on Learning Representations, 2019.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information ProcessingSystems, pages 5998–6008, 2017.

[16] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.Graph attention networks. In International Conference on Learning Representations, 2018.

[17] Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, and Tamara L Berg. Who are you with and where areyou going? In IEEE Conference on Computer Vision and Pattern Recognition, pages 1345–1352. IEEE,2011.

3

A Full description of FQA

Following previous work [3], we learn to predict offsets from the current position of each agent.However, we break the offset into a first-order linear velocity estimate and a correction term (seeeq 1). This reflects the inductive bias that most inanimate objects and agents tend to travel withnearly constant velocity in the absence of external influence, since only the velocity correction termaccounts for non-linearities and inter-agent interactions.

pt+1i = pti + vti + ∆vti , ∀i ∈ 1 : N (1)

(Linear velocity): vti = pti − pt−1i , ∀i ∈ 1 : N (2)

(Agent’s Intentions): hti = LSTM(pti, ht−1i ), ∀i ∈ 1 : N (3)

(Interactions): ht,∆vt = InteractionModule(pt, ht) (4)

The first-order linear estimate of velocity (vti ) is made by a direct difference of agents’ positions fromconsecutive time steps (eq 2). To compute the correction term (∆vti ), a preliminary update is firstmade to the LSTM’s hidden state using the incoming observation for each agent. This preliminaryupdate captures the deviations from linearity due to an agent’s own intentional acceleration orretardation (eq 3). The intermediate hidden states hti and the current positions of all agents are furtherused to infer the ongoing interactions between agents, aggregate their effects and update the hiddenstate of each agent to hti while also computing the correction term for the agent’s velocity via aninteraction module (eq 4). Since computation in all sub-modules happens at time t, we drop thesuperscript t from here on.

Interaction module: The interaction module (Figure 1b) first creates a graph by generating directededges between all pairs of agents (ignoring self-edges). The edge set E , the positions and the states ofall agents are used to compute an attention vector ai for each agent aggregating all its interactionswith other agents via the Fuzzy Query Attention (FQA) module (eq 5). This aggregated attentionalong with each agent’s current position and intermediate hidden state is processed by subsequentfully-connected layers to generate the updated state hi (which is fed back into the LSTM) and thevelocity correction ∆vi for each agent (eqs 6 and 7).

a = FQA(p, h, E) (5)hi = FC2(ReLU(FC1(pi, hi, ai))), ∀i ∈ 1 : N (6)

∆vi = FC4(ReLU(FC3(hi))), ∀i ∈ 1 : N (7)

Fuzzy Query Attention: The FQA module treats each directed edge as a sender-receiver (s− r) pairof agents (Figure 1c). It generates independent features: ps, pr, hs and hr for the senders and receiversby replicating p and h along each edge and also relative features: psr = ps−pr (relative displacement),hsr = hs − hr (relative state), psr = psr/‖psr‖ (unit-vector along psr) and hsr = hsr/‖hsr‖ (unit-vector along hsr). These features fsr = {ps, pr, psr, psr, hs, hr, hsr, hsr} are combined by singlefully-connected layers to generate n keys Ksr ∈ Rn×d and queries Qsr ∈ Rn×d of dimension d eachfor every s− r pair (eqs 8 and 9), which are then combined via a variant of dot-product attention togenerate fuzzy decisions Dsr ∈ Rn (eq 10):

Ksr = FC5(f⊥sr), ∀(s, r) ∈ 1 : N, s 6= r (8)

Qsr = FC6(f⊥sr), ∀(s, r) ∈ 1 : N, s 6= r (9)

Dsr = σ(Ksr ? Qsr +B) = σ

( ∑dim=1

Ksr �Qsr +B

),

∀(s, r) ∈ 1 : N, s 6= r (10)

where � represents element-wise product, B ∈ Rn is a learnable bias parameter, σ stands forthe sigmoid activation function and ⊥ stands for the detach operator1. As a consequence of thisformulation, Dsr ∈ [0, 1]n can be interpreted as a set of n fuzzy decisions capturing the interaction

1The detach operator acts as identity for the forward-pass but prevents any gradients from propagating backthrough its operand. This allows us to learn feature representations only using responses while the keys andqueries make useful decisions from the learnt features.

4

between agents s and r. These can now be used to fuzzily select the receiving agent’s response tothe current state of the sending agent. For this, the sender-receiver features are parallely parsed bytwo-layer neural networks (with the first layer having a ReLU activation) to generate yes-no responsesVy,sr, Vn,sr ∈ Rn×dv corresponding to Dsr being 1 (yes) or 0 (no) respectively (eqs 11 and 12).Though all the s− r features can be used here, our preliminary experiments showed that includingonly a subset of features (hs and psr) gave comparable results and led to considerable saving in thenumber of parameters, so we only use this subset of features to generate the yes-no responses. Theseresponses are then combined using a fuzzy if-else according to decisions Dsr to generate the finalresponses Vsr ∈ Rn×dv (eq 13):

Vy,sr = FC8(ReLU(FC7(psr, hs))),

∀(s, r) ∈ 1 : N, s 6= r (11)Vn,sr = FC10(ReLU(FC9(psr, hs))),

∀(s, r) ∈ 1 : N, s 6= r (12)(Fuzzy if-else): Vsr = DsrVy,sr + (1−Dsr)Vn,sr,

∀(s, r) ∈ 1 : N, s 6= r (13)

The n final responses generated per agent pair (∈ Rn×dv ) are then concatenated (∈ Rndv ) and finalresponses from all senders are aggregated on the respected receivers by dimension-wise max-poolingto accumulate effect of all interactions on the receiver agents (eqs 14 and 15). Since max-poolingloses information while aggregating by selecting only the maximum feature values amongst allsenders, we pre-process the final responses to increase the dimensions and retain more informationfollowed by subsequent post-processing after aggregation to reduce the number of dimensions again(eqs 14 and 16):

Vproc,sr = FC11(concat(Vsr)) (14)Vproc,r = maxpools:(s−r)∈EVproc,sr (15)

ar = FC12(Vproc,r), ∀r ∈ 1 : N. (16)

Strengths of FQA: While derived from multi-head self-attention introduced in [15], the novelty ofFQA lies in generalizing self-attention which applied to a single entity, to pairwise-attention whichattends to an ordered pair (sender-receiver) of entities and captures the interaction effects of the senderon the receiver. We simultaneously add a learnable bias B to improve modeling power (explainedbelow). Further, though the original matrix-dot-product structure of self-attention requires a largememory to fit even for regular batch sizes e.g. 32, our simpler row-wise dot-product structure fitseasily on a single GPU (≤ 12GB memory) for all datasets, while still retaining the strong performanceof the dot-product attention structure.

What kinds of decisions can FQA learn?: Since keys and queries are linear in the senders’ andreceivers’ states and positions, the decision space of FQA contains many intuitive decisions fortrajectory prediction, e.g.:

1. Proximity: FQA can potentially learn a key-query pair to be psr each and the correspondingbias as −d2th, then the decision D = σ(pTsrpsr − d2th) going to zero reflects if agents s andr are closer than distance dth. Note that such decisions would not be possible without thelearnable bias parameter B, hence having the bias makes FQA more flexible.

2. Approach: Since a part of the state hi can learn to model velocity of agents vi internally,FQA can potentially learn a key-query pair of the form Ksr = vsr, Qsr = psr, B = 0to model D = σ(vTsrpsr + 0) which tends to 0 when the agents are directly approachingeach other. While we do not force FQA to learn such human-interpretable decisions, ourexperiments show that the fuzzy decisions learnt by FQA are highly predictive of interactionsbetween agents (section B.3).

Training: FQA and all our other baselines detailed in the next section are trained to minimize themean-square error in predicting next time-step positions of all agents. Since many datasets involveagents entering and exiting the scene freely between frames, we input binary masks to all modelsfor each agent to determine the presence of that agent in the current frame and control updates forthat agents accordingly (masks not shown in Figures 1 and 1c to avoid clutter). All models aretrained with the Adam optimizer [9] with a batch size of 32 and with an initial learning rate of 0.001

5

decaying multiplicatively by a factor γ = 0.8 every 5 epochs. All models are allowed to train for atleast 50 epochs after which early stopping is enabled with a max patience of 10 epochs on validationset mean-square error and training is terminated at a maximum of 100 epochs. Since our goal isto test the models by observing Tobs (which was kept at 2T

5 for all datasets) time-steps and makepredictions until the remaining time T , we followed a dynamic schedule allowing all models to seethe real observations for Ttemp time-steps followed by T −Ttemp of its own last time-step predictions.During training, Ttemp is initialized at T and linearly decayed by 1 every epoch until it becomesequal to Tobs. We found this dynamic burn-in schedule employed during training to improve theprediction performance for all models.

B Experiments

We perform multi-agent trajectory prediction on different datasets used previously in the literaturewith a diverse variety of interaction characteristics. For datasets with no provided splits, we follow a70 : 15 : 15 split for training, validation and test set scenes.

1. ETH-UCY [3]: A human crowds dataset with medium interaction density. We sampled about3400 scenes at random from the dataset and set T = 20 following prior work [1, 7].

2. Collisions: Synthetic physics data with balls moving on a friction-less 2D plane, fixed circularlandmarks and boundary walls. The collisions between balls preserve momentum and energy, whilecollisions of agents with walls or immobile landmarks only preserve energy but not momentum ofmoving agents. Contains about 9500 scenes with T = 25.

3. NGsim [5]: US-101 and i-80 freeway traffic data with fast moving vehicles. Since this datasetfeatures very high agent density per scene (ranging in several thousands), we chunked the freewayswith horizontal and vertical lines into sub-sections to restrict the number of vehicles in a sub-sceneto less than 15. We sampled about 3500 sub-scenes from the resulting chunks and set T = 20.

4. Charges [10]: Physics data with positive and negative charges moving under other charges’electric fields and colliding with bounding walls. Contains 3600 scenes with T = 25 involvingdense attractive and repulsive interactions.

We compare our FQA architecture with state-of-the-art baselines (see section C for architecturedetails and unique hyperparameters of all methods):

1. Vanilla LSTM [VLSTM]: An LSTM preceeded and followed by fully-connected neural networklayers is used to predict the offset without considering interactions.

2. Social LSTM [SLSTM] [1]: Recurrent architecture which models interactions by discretizingspace around each agent and aggregating neighbors’ latent states via a social pooling mechanism.

3. GraphSAGE [GSAGE] [8]: Graph neural networks with node features to model interactionsbetween agents. We use feature-wise max-pooling for aggregating the messages passed along theedges.

4. Graph Networks [GN] [2, 14]: Graph neural networks with node features, edge features andglobal features to model interactions between agents. We adapt the Encoder→RecurrentGN→Decoder architecture from [14].

5. Neural Relational Inference [NRI] [10]: Uses graph neural networks to model interactionsbetween agents and additionally infers edges between agents using variational inference.

6. Graph Attention Networks [GAT] [16]: Follows an aggregation style similar to GraphSAGE,but weighs messages passed from all sender agents via a learnt attention mechanism.

To evaluate and compare all models, we report the Root Mean Square Error (RMSE) between groundtruth and our predictions over all predicted time steps for all agents on the test set of every dataset.The standard deviation is computed on the test set RMSE over five independent training runs of amethod differing only in their initial random seed.

B.1 Prediction results

We present prediction results on all datasets for each model in Table 2. Our model with n = 8decisions outperforms all the state-of-the-art baselines on all benchmark datasets (on many by

6

Table 2: Prediction error metrics for all methods on all datasets


VLSTM 0.576 ± 0.002 0.245 ± 0.001 5.972 ± 0.065 0.533 ± 0.001SLSTM 0.690 ± 0.013 0.211 ± 0.002 6.453 ± 0.153 0.485 ± 0.005NRI 0.778 ± 0.027 0.254 ± 0.002 7.491 ± 0.737 0.557 ± 0.008GN 0.577 ± 0.014 0.234 ± 0.001 5.901 ± 0.238 0.508 ± 0.006GSAGE 0.590 ± 0.011 0.238 ± 0.001 5.582 ± 0.082 0.522 ± 0.002GAT 0.575 ± 0.007 0.237 ± 0.001 6.100 ± 0.063 0.524 ± 0.004FQA (ours) 0.540 ± 0.006 0.176 ± 0.004 5.071 ± 0.186 0.409 ± 0.019

significant margins). This shows that FQA can accurately model diverse kinds of interactions.Specifically, we observe that all models find it difficult to model sparse interactions on the Collisionsdata, while FQA performs significantly better with lower errors presumably due to its fuzzy decisionmaking capability letting it decide when two agents are interacting (more detail in section B.3).Further, even though GAT also uses an attention mechanism at the receiver agents to aggregatemessages passed from sender agents, FQA outperforms GAT on all datasets showing that FQA mighthave a better inductive bias towards modeling multi-agent interactions for trajectory prediction.

Further, as a side note, we point out that SLSTM [1] and NRI [10] both of which model interactionsare often outperformed by VLSTM which does not model interactions. While surprising at first, wefound that this has also been confirmed for SLSTM by prior works, namely, Social GAN [7] whichhas common co-authors with SLSTM, and also independently by the TrajNet Benchmark paper [3].We believe that this is because both methods introduce significant noise in the neighborhood ofentities: (a) SLSTM does this via its summation of neighbors’ hidden states within discretized binswhich can potentially lose significant direction and motion specific information, and (b) NRI infersmany spurious edges during its variational inference encoding process which was also shown by [12].

B.2 Ablations

To show that it is indeed the fuzzy decisions attention mechanism which lends our model its strength,we present several ablations of our model.

Removing velocity correction: We first remove the velocity correction term (∆vti ) from our pre-dicted offset to show that this term which models deviations from linear velocity estimate is indeedrequired for accurate prediction. We call this model FQAlin and Table 3 shows the stark deteriorationin performance after the removal of velocity correction.

Removing decision making of FQA: Next, we replaced the keys, queries and decisions in FQA by asequence of equivalent fully-connected layers so that responses Vsr are directly produced from inputfeatures without any fuzzy decisions, while retaining the subsequent concatenation and aggregationof responses. We call this variant FQANoDec and show the deterioration in performance from loss ofdecision making in Table 3. It is clear that removing the decision making reduces FQA to the same orworse level of performance as other baselines on most benchmark datasets, and demonstrates that thestrength of FQA comes from its fuzzy decision making process.

Table 3: Prediction error metrics with ablations and augmentations


FQAlin 0.576 ± 0.000 0.519 ± 0.000 6.159 ± 0.000 0.778 ± 0.000

FQAnodec 0.539 ± 0.006 0.234 ± 0.001 5.616 ± 0.163 0.505 ± 0.007

GNdce 0.572 ± 0.020 0.227 ± 0.002 5.714 ± 0.155 0.451 ± 0.004GSAGEdce 0.579 ± 0.011 0.231 ± 0.001 5.901 ± 0.099 0.456 ± 0.005GATdce 0.571 ± 0.006 0.232 ± 0.001 5.936 ± 0.124 0.460 ± 0.008FQAdce 0.532 ± 0.002 0.175 ± 0.004 5.814 ± 0.170 0.416 ± 0.001

7

B.3 Understanding fuzzy decisions of FQA

Distance-based cutoff for edges: To check if FQA can learn fuzzy decisions to reflect proximitybetween agents, we replaced our edge generator to produce edges with a distance-based cutoff soit outputs a directed edge between agents s and r only if ‖pts − ptr‖2 ≤ dthresh. The thresholddthresh was found by a crude hyperparameter search and was set to dthresh = 0.5 in the normalizedcoordinates provided to all models. We show prediction errors for FQA and other baselines namelyGN, GSAGE and GAT2 by providing them distance-constrained edges instead of all edges (dcevariants) in Table 3. While dce variants of baselines show improvement in prediction errors on mostdatasets, FQA only shows minor improvements on Collisions which has sparse density of interactions,while the performance degrades on the other datasets with dense interactions. This suggests that FQAis indeed able to model proximity between agents even from a fully-connected graph, if the dataset issufficiently dense in the number of interactions per time-step and does not require aiding heuristics,while other baselines do not necessarily extract this information and hence benefit from the heuristic.

Table 4: Predicting collisions from FQA decisionsτ Accuracy AUROC

1 95.55% 0.8542 95.48% 0.8663 95.35% 0.870

Recurrent 95.75% 0.907

(a) Collisions data: FQA models sparse interactions like inter-agent collisions well.

(b) Collsions data: FQA models stationary fixed landmarks well (blue) and predicts sharp collisions with walls.

(c) Charges data: Complex swirling in opposite charges (see pink and orange trajectories) accompanied by highaccelerations; No model except FQA is able to predict such complex motion.

Figure 2: Predicted trajectories from all models shown with circles of radii increasing with time. Thelighter shades show the observed part uptil Tobs while the darker shades show the predictions till T .

Predicting interactions from decisions: To investigate if the fuzzy decisions capture inter-agentinteractions well, we present an experiment to predict when a collision happens between two agentson the Collisions dataset3 from only the 8 agent-pair decisions Dt

sr. Since collisions are sparse, wepresent the prediction accuracy and the area under the ROC curve on a held-out test set in Table 4 forvarious classifiers trained to predict collisions between agents using different horizon of time-steps(τ ) of the input decisions. Note that we do not even use the agents’ positions, velocities or the

2SLSTM already uses a neighborhood size of 0.5 for discretization, while NRI infers edges internally viavariational inference.

3This is the only synthetic dataset for which the ground truth of interactions is available.

8

FQA responses (Vsr) as inputs to the predictors. Yet, the decision-trajectories alone are sufficient topredict collisions with a surprisingly high accuracy and AUROC, which strongly indicates that FQA’sdecisions are accurately capturing inter-agent interactions.

Visualization: Next we visualize the trajectories predicted by FQA and other baselines. Figures2a and 2b show inter-agent collisions and those between agents and boundaries respectively. Dueto agents’ small sizes, inter-agent collisions are sparse events and only FQA learns to model themappropriately while the other baselines ignore them. Further FQA models the trajectories of agentsfaithfully and all collisions sharply while other baselines sometimes predict curved trajectories andpremature soft collisions in empty space without any real interaction. We further observe from thepink and orange charges in Figure 2c, that it is hard to model chaotic swirling of nearby oppositecharges due to high accelerations resulting from coulombic forces and that FQA comes closest tobeing an accurate model. More visualization examples are shown in section D.

C Model architectures and hyperparameters

We provide the hyperparameters of our baselines and those of FQA in this section. All our experimentswere done on systems with Ubuntu 16.04 and all models trained using either Nvidia Titan X or NvidiaGeForce GTX 1080 Ti GPUs. All code was written in Python 3.6 with neural network architecturesdefined and trained using PyTorch v1.0.0.

C.1 Vanilla LSTM

The Vanilla LSTM model embeds each pti to a 32-dimensional embedding vector using a fully-connected layer with ReLU activation. This vector is fed along with the previous hidden states to anLSTM with state size 64, whose output is again processed by a fully-connected layer to generate the2-dimensional offset for next-step prediction.

C.2 Social LSTM

We adapted the code from https://github.com/quancore/social-lstm which directly repro-duces the original authors’ model from [1]. We kept the initial embedding size as 20, the LSTM’shiddden size as 40, the size of the discretization grid as 4 and the discretization neighborhood size as0.54.

C.3 Neural Relational Inference

We adapted the authors’ official repository from https://github.com/ethanfetaya/NRI. Theinput dimension was kept as 2 for positional coordinates and the number of edge types as 3 (sincesetting it to 2 gave worse results). The encoder employed the MLP architecture with hidden layersof sizes 32 and no dropout, while the GRU-based RNN architecture was used for the decoder withhidden state of size 32 and no dropout. The variance of the output distribution was set to 5× 10−5.

C.4 Graph Networks

While the original repository for Graph Networks is written in TensorFlow (https://github.com/deepmind/graph_nets), we translated the repository into PyTorch and adapted models similarto those employed by [14, 2]. We employed a vertex-level encoder followed by a recurrent GraphNetwork based on GRU-style recurrence followed by a Graph Net decoder. The vertex-level encodertransforms 2-dimensional positional input at each time step to a 10-dimensional node embedding. Aninput graph is constructed from these node embeddings having all pairwise edges and dimensions10, 1 and 1 respectively for the node, edge and global attributes. This input graph along with aprevious state graph (with dimensions 45, 8 and 8 for node, edge and global state attributes) wasprocessed using a GRU-style recurrent Graph Network to output the updated state graph of thesame dimensions (45, 8 and 8 for node, edge and global state attributes respectively). This new stategraph was processed by a feedforward graph-network as prescribed in [2] to output another graphwhose node features of dimensions 2 were treated as offsets for the next time step prediction. All

4This neighborhood size is also the same as the distance cutoff used in section B.3.

9

https://github.com/quancore/social-lstm

https://github.com/ethanfetaya/NRI

https://github.com/deepmind/graph_nets

https://github.com/deepmind/graph_nets

update networks both in the encoder and the decoder (for node, edge and global features) used twofeedforward layers with the intermediate layer having latent dimension 32 and a ReLU activation.While the original work proposes to use sum as the aggregation operator, we found summing tooften cause the training to divergence since different agents have neighborhoods of very diverse sizesranging from 0 to about 40 at different times in many of our datasets. Hence we used feature-wisemean-pooling for all aggregation operators.

C.5 GraphSAGE, Graph Attention Networks and Fuzzy Query Attention

Since GraphSAGE (GSAGE) [8] and Graph Attention Networks (GAT) [16] were not originallyprescribed for a multi-agent trajectory prediction application, we used their update and aggregationstyles in our own FQA framework to replace the FQA sub-module in our Interaction module describedin Section 2. For all three methods the input size and the output size was 2, while the hidden statedimension of the LSTM shared by all agents was 32. The dimension of the aggregated attentionfor each agent ati was also set to 32 for all three methods. All the three methods involved theFC1, FC2, FC3 and FC4 layers described in section 2 and had the output sizes 48, 32, 16 and 2respectively.

GSAGE: GraphSAGE [8] directly embeds all sender latent vectors hs into 32-dimensional embed-dings via two fully-connected layers each with a RELU activation and with the intermediate layerof dimensions 32. The output embeddings were aggregated into the receiver nodes via feature-wisemax-pooling to generate ati.

GAT: GAT performs a similar embedding of sender hidden states using a similar embedding networkas GSAGE but aggregates them via feature-wise max-pooling after weighing the embeddings with 8attention head coefficients generated as proposed in [16] and finally averages over the 8 aggregations.We used 8 attention heads to match the number of FQA’s decisions.

FQA: FQA used 8 query-key pairs for all datasets leading to 8 fuzzy decisions. The dimen-sion for keys and queries was set to 4, while the dimension for yes-no responses was kept as6. Consequently the dimension of learnt bias vector B was also 8 and the sizes of the fully-connected layers FC5, FC6, FC7, FC8, FC9, FC10, FC11 and FC12 described in the main textwere 32, 32, 33, 48, 33, 48, 32 and 32 respectively.

D Additional visualizations

Here we show additional visualization from all models on all datasets. The visualizations clearlydemonstrate the strong inductive bias of FQA for trajectory prediction.

Figure 3: Predicted trajectory visualization from various models on Charges dataset.

10

Figure 4: Predicted trajectory visualization from various models on ETH-UCY dataset.

Figure 5: Predicted trajectory visualization from various models on Collisions dataset.

Figure 6: Predicted trajectory visualization from various models on NGsim dataset.

11

Where is the World Headed? Trajectory Prediction for ...jmcauley/workshops/scmls... · Where is the...

Documents

Transcript of Where is the World Headed? Trajectory Prediction for ...jmcauley/workshops/scmls... · Where is the...