IONet: Learning to Cure the Curse of Drift in Inertial Odometry to Cure the Curse of...Pedestrian...

IONet: Learning to Cure the Curse of Drift in Inertial Odometry

Changhao Chen, Xiaoxuan Lu, Andrew Markham, Niki TrigoniDepartment of Computer Science, University of Oxford, United Kingdom

Email: [email protected]

Abstract

Inertial sensors play a pivotal role in indoor localiza-tion, which in turn lays the foundation for pervasivepersonal applications. However, low-cost inertial sen-sors, as commonly found in smartphones, are plaguedby bias and noise, which leads to unbounded growth inerror when accelerations are double integrated to obtaindisplacement. Small errors in state estimation propagateto make odometry virtually unusable in a matter of sec-onds. We propose to break the cycle of continuous inte-gration, and instead segment inertial data into indepen-dent windows. The challenge becomes estimating thelatent states of each window, such as velocity and orien-tation, as these are not directly observable from sensordata. We demonstrate how to formulate this as an opti-mization problem, and show how deep recurrent neuralnetworks can yield highly accurate trajectories, outper-forming state-of-the-art shallow techniques, on a widerange of tests and attachments. In particular, we demon-strate that IONet can generalize to estimate odometryfor non-periodic motion, such as a shopping trolley orbaby-stroller, an extremely challenging task for existingtechniques.

Fast and accurate indoor localization is a fundamental needfor many personal applications, including smart retail, pub-lic places navigation, human-robot interaction and aug-mented reality. One of the most promising approaches is touse inertial sensors to perform dead reckoning, which has at-tracted great attention from both academia and industry, be-cause of its superior mobility and flexibility (Lymberopouloset al. 2015).

Recent advances of MEMS (Micro-electro-mechanicalsystems) sensors have enabled inertial measurement units(IMUs) small and cheap enough to be deployed on smart-phones. However, the low-cost inertial sensors on smart-phones are plagued by high sensor noise, leading to un-bounded system drifts. Based on Newtonian mechanics, tra-ditional strapdown inertial navigation systems (SINS) inte-grate IMU measurements directly. They are hard to realizeon accuracy-limited IMU due to exponential error propaga-tion through integration. To address these problems, step-based pedestrian dead reckoning (PDR) has been proposed.

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Inputs:IMU data Sequence

Deep Neural Networks Outputs:Full Trajectory

Figure 1: Overview of our proposed learning-based method

This approach estimates trajectories by detecting steps, esti-mating step length and heading, and updating locations perstep (Li et al. 2012). Instead of double integrating accelera-tions into locations, a step length update mitigates exponen-tial increasing drifts into linear increasing drifts. However,dynamic step estimation is heavily influenced by sensornoise, user’s walking habits and phone attachment changes,causing unavoidable errors to the entire system (Brajdic andHarle 2013). In some scenarios, no steps can be detected,for example, if a phone is placed on a baby stroller or shop-ping trolley, the assumption of periodicity, exploited by step-based PDR would break down. Therefore, the intrinsic prob-lems of SINS and PDR prevent widespread use of inertiallocalization in daily life. The architecture of two existingmethods is illustrated in Figure 2.

To cure the unavoidable ‘curse’ of inertial system drifts,we break the cycle of continuous error propagation, and re-formulate inertial tracking as a sequential learning problem.Instead of developing multiple modules for step-based PDR,our model can provide continuous trajectory for indoor usersfrom raw data without the need of any hand-engineering, asshown in Figure 1. Our contributions are three-fold:

• We cast the inertial tracking problem as a sequentiallearning approach by deriving a sequence-based physicalmodel from Newtonian mechanics.

• We propose the first deep neural network (DNN) frame-work that learns location transforms in polar coordinatesfrom raw IMU data, and constructs inertial odometry re-gardless of IMU attachment.

arX

iv:1

802.

0220

9v1

[cs

.RO

] 3

0 Ja

n 20

18

• We collected a large dataset for training and testing, andconducted extensive experiments across different attach-ments, users/devices and new environment, whose resultsoutperform traditional SINS and PDR mechanisms. In ad-dition, we demonstrate that our model can generalize to amore general motion without regular periodicity, e.g. trol-ley or other wheeled configurations.

Related WorkIn this section, we provide a brief overview of some relatedwork in inertial navigation systems, pedestrian dead reckon-ing (PDR) and sequential deep learning.

Strapdown Inertial Navigation System: strapdown in-ertial navigation systems (SINS) have been studied fordecades (Savage 1998). Previous inertial systems heavily re-lied on expensive, heavy, high-precision inertial measure-ment units, hence their main application had to be con-strained on moving vehicles, such as automobiles, ships, air-craft, submarines and spacecraft. Recent advances of MEMStechnology enable low-cost MEMS IMU to be deployed onrobotics, UAV (Bloesch et al. 2015), and mobile devices(Lymberopoulos et al. 2015). However, restricted by size andcost, the accuracy of a MEMS IMU is extremely limited, andhas to be integrated with other sensors, such as visual inertialodometry (Leutenegger et al. 2015). Another solution is toattach an IMU on the user’s foot in order to take advantage ofheel strikes for zero-velocity update to compensate systemerror drifts (Skog et al. 2010). These inconveniences preventwide adoption of inertial solutions on consumer grade de-vices (Harle 2013).

Pedestrian Dead Reckoning: Unlike SINS’s open-loopintegration of inertial sensors, PDR uses inertial measure-ments to detect steps, estimate stride length and heading viaan empirical formula (Shu et al. 2015). System errors stillquickly accumulate, because of incorrect step displacementsegmentation and inaccurate stride estimation. In addition,a large number of parameters have to be carefully tuned ac-cording to a user’s walking habits. Recent research mainlyfocused on fusing PDR with external references, such as afloor plan (Xiao et al. 2014a), WiFi fingerprinting (Hilsen-beck et al. 2014) or ambient magnetic fields (Wang et al.2016), still leaving fundamental problem of rapid drift un-solved. Compared with prior work, we abandon step-basedapproach and present a new general framework for inertialodometry. This allows us to handle more general trackingproblems, including trolley/wheeled configurations, whichstep-based PDR cannot address.

Sequential Deep Learning: Deep learning approacheshave recently shown excellent performance in handling se-quential data, such as speech recognition (Graves and Jaitly2014), machine translation (Dai and Le 2015), visual track-ing (Ondruska and Posner 2016) and video description(Donahue et al. 2015). To the best of our knowledge, ourIONet is the first neural network framework to achieve in-ertial odometry using inertial data only. Previous learning-based work has tackled localization problems, by employ-ing visual odometry (Zhou et al. 2017; Wang et al. 2017;Clark et al. 2017a) and visual inertial odometry (Clark et al.

Accelerometer

Gyroscope

Step Detector

Heading Estimation

Step Length Estimation

Personal Step Model

PositionUpdate

IMU

Pedestrian Dead Reckoning (PDR)

Accelerometers

Gyroscopes

∫ dt ∫ dt

∫ dtg

position

velocity

orientation

IMU INS Mechanism

Strapdown Inertial Navigation System (SINS)Existing Methods:

Figure 2: Architecture of existing methods: SINS and PDR

2017b). Some other work has concentrated on learning in-tuitive physics (Hooman et al. 2017), modeling state spacemodels (Karl et al. 2016), and supervising neural networksvia physics knowledge (Stewart and Ermon 2017). Whilemost of them use visual observations, our work exploits realworld sensor measurements to learn high-level motion tra-jectories.

The Curse of Inertial TrackingThe principles of inertial navigation are based on Newtonianmechanics. They allow tracking the position and orientationof an object in a navigation frame given an initial pose andmeasurements from accelerometers and gyroscopes.

Figure 2 illustrates the basic mechanism of inertial navi-gation algorithms. The three-axis gyroscope measures angu-lar velocities of the body frame with respect to the naviga-tion frame, which are integrated into pose attitudes in Equa-tions (1-3). To represent the orientation, the direction cosineCnb matrix is used to represent the transformation from the

body (b) frame to the navigation (n) frame, and is updatedwith a relative rotation matrix Ω(t). The 3-axis accelerome-ter measures proper acceleration vectors in the body frame,which are first transformed to the navigation frame and thenintegrated into velocity, discarding the contribution of grav-ity forces g in Equation (4). The locations are updated by in-tegrating velocity in Equation (5). Equations (1-5) describethe attitude, velocity and location update at any time stampt. In our application scenarios, the effects of earth rotationand the Coriolis accelerations are ignored.

Attitude Update:

Cnb (t) = Cn

b (t− 1) ∗Ω(t) (1)

σ = w(t)dt (2)

Ω(t) = Cbt−1

bt= I+

sin(σ)

σ[σ×]+

1− cos(σ)

σ2[σ×]2 (3)

Velocity Update:

v(t) = v(t− 1) + ((Cnb (t− 1)) ∗ a(t)− gn)dt (4)

Location Update:

L(t) = L(t− 1) + v(t− 1)dt (5)

where a and w are accelerations and angular velocities inbody frame measured by IMU, v and L are velocities andlocations in navigation frame, and g is gravity.

Under ideal condition, SINS sensors and algorithmscan estimate system states for all future times. High-precision INS in military applications (aviation and ma-rine/submarine) uses highly accurate and costly sensors tokeep measurement errors very small. Meanwhile, they re-quire a time-consuming system initialization including sen-sor calibration and orientation initialization. However, theserequirements are inappropriate for pedestrian tracking. Real-izing a SINS mechanism on low-cost MEMS IMU platformsuffers from the following two problems:

• The measurements from IMUs embedded in consumerphones are corrupted with various error sources, such asscale factor, axis misalignment, thermo-mechanical whitenoise and random walking noise (Naser 2008). From atti-tude update to location update, the INS algorithm sees atriple integration from raw data to locations. Even a tinynoise will be highly exaggerated through this open-loopintegration, causing the whole system to collapse withinseconds.

• A time-consuming initialization process is not suitable foreveryday usage, especially for orientation initialization.Even small orientation errors would lead to the incorrectprojection of the gravity vector. For example, a 1 degreeattitude error will cause an additional 0.1712 m/s2 accel-eration on the horizontal plane, leading to 1.7 m/s velocityerror and 8.56 m location error within 10 seconds.

Even if we view this physical model as a state-space-model, and apply a deep neural network directly to modelinertial motion, the intrinsic problems would still ‘curse’ theentire system.

Tracking Down A CureTo address the problems of error propagation, our novel in-sight is to break the cycle of continuous integration, and seg-ment inertial data into independent windows. This is analo-gous to resetting an integrator to prevent windup in classicalcontrol theory (Hippe 2006).

However, windowed inertial data is not independent, asEquations (1-5) clearly demonstrate. This is because keystates (namely attitude, velocity and location) are unobserv-able - they have to be derived from previous system statesand inertial measurements, and propagated across time. Un-fortunately, errors are also propagated across time, curs-ing inertial odometry. It is clearly impossible for windowsto be truly independent. However, we can aim for pseudo-independence, where we estimate the change in navigationstate over each window. Our problem then becomes how toconstrain or estimate these unobservable states over a win-dow. Following this idea, we derive a novel sequence-basedphysical model from basic Newtonian Laws of Motion, andreformulate it into a learning model.

The unobservable or latent system states of an inertial sys-tem consist of orientation Cn

b , velocity v and position L.In a traditional model, the transformation of system statescould be expressed as a transfer function/state space modelbetween two time frames in Equation (6), and the systemstates are directly coupled with each other.

[Cnb v L]t = f([Cn

b v L]t−1, [a w]t) (6)

We first consider displacement. To separate the displace-ment of a window from the prior window, we compute thechange in displacement ∆L over an independent window ofn time samples, which is simply:

∆L =

∫ n−1

t=0

v(t)dt (7)

We can separate this out into a contribution from the ini-tial velocity state, and the accelerations in the navigationframe:

∆L = nv(0)dt+[(n−1)s1+(n−2)s2+· · ·+sn−1]dt2 (8)

wheres(t) = Cn

b (t− 1)a(t)− g (9)is the acceleration in the navigation frame, comprising a dy-namic part and a constant part due to gravity.

Then, Equation (8) is further formulated as:

∆L = nv(0)dt+ [(n− 1)Cnb (0) ∗ a1 + (n− 2)Cn

b (0)Ω(1)

∗ a2 +· · ·+ Cnb (0)

n−2∏i=1

Ω(i) ∗ an−1]dt2 − n(n− 1)

2gdt2

(10)

and simplified to become:

∆L = nv(0)dt+ Cnb (0)Tdt2 − n(n− 1)

2gdt2 (11)

where

T = (n−1)a1+(n−2)Ω(1)a2+· · ·+n−2∏i=1

Ω(i)an−1 (12)

In our work, we consider the problem of indoor position-ing i.e. tracking objects and people on a horizontal plane.This introduces a key observation: in the navigation frame,there is no long-term change in height1. The mean displace-ment in the z axis over a window is assumed to be zero andthus can be removed from the formulation. We can computethe absolute change in distance over a window as the L-2norm i.e. ∆l = ‖∆L‖2, effectively decoupling the distancetraveled from the orientation (e.g. heading angle) traveled,leading to:

∆l = ‖nv(0)dt+ Cnb (0)Tdt2 − n(n− 1)

2gdt2‖2

= ‖Cnb (0)(nvb(0)dt+ Tdt2 − n(n− 1)

2gb0dt

2)‖2(13)

1This assumption can be relaxed through the use of additionalsensor modalities such as a barometer to detect changes in floorlevel due to stairs or elevator.

Because the rotation matrix Cnb (0) is an orthogonal ma-

trix i.e. Cnb (0)TCn

b (0) = I, the initial unknown orientationhas been successfully removed from, giving us:

∆l = ‖∆L‖2 = ‖nvb(0)dt+ Tdt2 − n(n− 1)

2gb0dt

2‖2(14)

Hence, over a window, the horizontal distance traveled canbe expressed as a function of the initial velocity, the grav-ity, and the linear and angular acceleration, all in the bodyframe:

∆l = f(vb(0),gb0,a1:n,w1:n) (15)

To determine the change in the user’s heading, we con-sider that a user’s real accelerations and angular rates(a1:n,w1:n) are also latent variables of IMU raw measure-ments (a1:n, w1:n), and on the horizontal plane, only theheading attitude is essential in our system. From Equations(2-3) the change in the heading ∆ψ is expressed as a func-tion of the raw data sequence. Therefore, we succeed in re-formulating traditional model as a polar vector (∆l,∆ψ)based model, which is only dependent on inertial sensordata, the initial velocity and gravity in the body frame:

(∆l,∆ψ) = fθ(vb(0),gb0, a1:n, w1:n) (16)

To derive a global location, the starting location (x0, y0)and heading ψ0 and the Cartesian projection of a number ofwindows can be written as,

xn = x0 + ∆lcos(ψ0 + ∆ψ)

yn = y0 + ∆lsin(ψ0 + ∆ψ)(17)

Our task now becomes how to implicitly estimate this ini-tial velocity and the gravity in body frame, by casting eachwindow as a sequence learning problem.

This section serves as an overview showing the transitionfrom the traditional model-based method to the proposedneural-network-based method. It takes the traditional state-space-model described in Equations (1-5), which convertsraw data to poses in a step-by-step manner, to a formulationwhere a window of raw inertial data is processed in a batchto estimate a displacement and an angle change. Note thatin both formulations, the final output depends on the initialattitude and velocity. As a result, in both cases, the curseof error accumulation won’t be avoided if using the model-based integration approach. However, our sequence basedformulation paves the way to our proposed neural networkapproach.

Deep Neural Network FrameworkEstimating the initial velocity and the gravity in the bodyframe explicitly using traditional techniques is an extremelychallenging problem. Rather than trying to determine thetwo terms, we instead treat Equation (16) as a sequence,where the inputs are the observed sensor data and the outputis the polar vector. The unobservable terms simply becomelatent states of the estimation. Intuitively, the motivation forthis relies on the regular and constrained nature of pedes-trian motion. Over a window, which could be a few seconds

Polar Vector

Polar Vector

Polar Vector

Polar Vector

Polar Vector

Polar Vector

Time

Pose Sequence

Full Trajectory

Deep Neural Networks

Polar Vector

IMU(1*6)

IMU(1*6)

IMU Sequence

(∆L, ∆ )1*2(∆L, ∆ )1*2(∆L, ∆ )1*2(∆L, ∆ )1*2(∆L, ∆ )1*2

IMU(1*6)

IMU(1*6)

IMU(1*6)

IMU(1*6)

IMU(1*6)

IMU(1*6)

IMU(1*6)

IMU(1*6)

Figure 3: Overview of IONet framework

long, a person walking at a certain rate induces a roughlysinusoidal acceleration pattern. The frequency of this sinu-soid relates to the walking speed. In addition, biomechanicalmeasurements of human motion show that as people walkfaster, their strides lengthen (Hausdorff 2007). Moreover, thegravity in body frame is related to the initial yaw and roll an-gle, determined by the attachment/placement of the device,which can be estimated from the raw data (Xiao et al. 2015).In this paper, we propose the use of deep neural networks tolearn the relationship between raw acceleration data and thepolar delta vector, as illustrated in Figure 3.

Input data are independent windows of consecutive IMUmeasurements, strongly temporal dependent, representingbody motions. To recover latent connections between mo-tion characteristics and data features, a deep recurrent neuralnetwork (RNN) is capable of exploiting these temporal de-pendencies by maintaining hidden states over the durationof a window. Note however that latent states are not propa-gated between windows. Effectively, the neural network actsas a function fθ that maps sensor measurements to polar dis-placement over a window

(a,w)200∗6fθ−→ (∆l,∆ψ)1∗2, (18)

where a window length of 200 frames (2 s) is used here2.In the physical model, orientation transformations impact

all subsequent outputs. We adopt Long Short-Term Memory(LSTM) to handle the exploding and vanishing problems ofvanilla RNN, as it has a much better ability to exploit thelong-term dependencies (Greff et al. 2016). In addition, asboth previous and future frames are crucial in updating thecurrent frame a bidirectional architecture is adopted to ex-ploit dynamic context.

Equation (16) shows that modeling the final polar vectorrequires modeling some intermediate latent variables, e.g.

2We experimented with a window size of 50, 100, 200 and 400frames, and selected 200 as an optimal parameter regarding thetrade-off between accumulative location error and predicted loss.

0 1 2 3 4 5 6

Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)User 1 IONetUser 1 PDRUser 1 SINSUser 2 IONetUser 2 PDRUser 2 SINSUser 3 IONetUser 3 PDRUser 3 SINSUser 4 IONetUser 4 PDRUser 4 SINS

(a) Handheld

0 1 2 3 4 5 6

Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)

User 1 IONetUser 1 PDRUser 1 SINSUser 2 IONetUser 2 PDRUser 2 SINSUser 3 IONetUser 3 PDRUser 3 SINSUser 4 IONetUser 4 PDRUser 4 SINS

(b) In Pocket

0 1 2 3 4 5 6

Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)

User 1 IONetUser 1 PDRUser 1 SINSUser 2 IONetUser 2 PDRUser 2 SINSUser 3 IONetUser 3 PDRUser 3 SINSUser 4 IONetUser 4 PDRUser 4 SINS

(c) In Handbag

Figure 4: Performance in experiments involving different users.

0 1 2 3 4 5 6Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)

iPhone 7 IONetiPhone 7 PDRiPhone 7 SINSiPhone 6 IONetiPhone 6 PDRiPhone 6 SINSiPhone 5 IONetiPhone 5 PDRiPhone 5 SINS

(a) Handheld

0 1 2 3 4 5 6

Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)


(b) In Pocket

0 1 2 3 4 5 6

Error(m)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F(%

)


(c) Handbag

Figure 5: Performance in experiments involving different devices.

(a) Handheld (b) In Pocket (c) In Handbag

Figure 6: Trajectories on Floor A

(a) Handheld (b) In Pocket (c) In Handbag

Figure 7: Trajectories on Floor B

0 10 20 30 40 50 60 70 80 90 100Epoch

0

0.05

0.1

0.15

0.2

0.25V

alid

atio

n Lo

ssVanilla RNNCNN1-layer LSTM2-layer LSTM2-layer Bi-LSTM

Figure 8: Loss of adopting various frameworks

initial velocity and gravity. Therefore, to build up higherrepresentation of IMU data, it is reasonable to stack 2-layerLSTMs on top of each other, with the output sequences ofthe first layer supplying the input sequences of the secondlayer. The second LSTM outputs one polar vector to repre-sent the transformation relation in the processed sequence.Each layer has 96 hidden nodes. To increase the output datarate of polar vectors and locations, IMU measurements aredivided into independent windows with a stride of 10 frames(0.1s).

The optimal parameter θ∗ inside the proposed deep RNNarchitecture can be recovered by minimizing a loss functionon the training dataset D = (X,Y).

θ∗ = arg minθ

`(fθ(X),Y) (19)

The loss function is defined as the sum of Euclidean dis-tances between the ground truth (∆l,∆ψ) and estimatedvalue (∆l,∆ψ).

` =∑‖∆l −∆l‖22 + κ‖∆ψ −∆ψ‖22 (20)

where κ is a factor to regulate the weights of ∆l and ∆ψ.

ExperimentsTraining DetailsDataset: There are no public datasets for indoor localizationusing phone-based IMU. We collected data ourselves withpedestrian walking inside a room installed with an opticalmotion capture system (Vicon) (Vicon 2017), providing veryhigh-precise full pose reference (0.01m for location, 0.1 de-gree for orientation) for our experimental device. The train-ing dataset consists of IMU data from the phone in differentattachments, e.g. hand-held, in pocket, in handbag, on trol-ley each for 2 hours, collected by a very common consumerphone iPhone 7Plus. Note that all of our training data wascollected by User 1 carrying iPhone 7. To test our model’sability to generalize, we invited 3 new participants and eval-uated on another two phones, i.e. iPhone 6 and iPhone 5.

Training: We implemented our model on the public avail-able TensorFlow framework, and ran training process on aNVIDIA TITAN X GPU. During training, we used Adam, afirst-order gradient-based optimizer (Kingma and Ba 2015)

with a learning rate of 0.0015. The training converges typ-ically after 100 iterations. To prevent our neural networksfrom overfitting, we gathered data with abundant movingcharacteristics, and adopted Dropout (Srivastava et al. 2014)in each LSTM layer, randomly dropping 25% units fromneural networks during training. This method significantlyreduces overfitting, and proves to perform well on new users,devices and environments.

Comparison with Other DNN Frameworks: To eval-uate our assumption of adopting a 2-layer BidirectionalLSTM for polar vector regression, we compare its valida-tion results with various other DNN frameworks, includingframeworks using vanilla RNN, vanilla Convolution Neu-ral Network, 1-layer LSTM and 2-layer LSTM without Bi-direction. The training data are from all attachments. Figure8 shows their validation loss lines. Our proposed frameworkwith 2-layer Bi-LSTM descends more steeply, and stayslower and more smoothly during the training than all otherneural networks, supporting our assumption, while vanillaRNN is stuck in vanishing problems, and CNN doesn’t seemto capture temporary dependencies well.

Testing: We also found that a separate training on ev-ery attachment shows better performance in prediction thantraining jointly, hence we implemented the prediction modelof 2-layer Bi-LSTM trained on separate attachments in ourfollowing experiments. In a practical deployment, existingtechniques can be adopted to recognize different attach-ments from pure IMU measurements (Brajdic and Harle2013), providing the ability to dynamically switch betweentrained models.

Baselines: Two traditional methods are selected as base-lines, pedestrian dead reckoning (PDR) and strapdown in-ertial navigation system (SINS) mechanism (Savage 1998),to compare with our prediction results. PDR algorithms areseldom made open-sourced, especially a robust PDR used indifferent attachments, so we implement code ourselves ac-cording to (Brajdic and Harle 2013) for step detection and(Xiao et al. 2014b) for heading and step length estimation.

Tests Involving Multiple Users and DevicesA series of experiments were conducted inside a large roomwith new users and phones to show our neural network’sability to generalize. Vicon system provides highly accuratereference to measure the location errors.

The first group of tests include four participants, walkingrandomly for two minutes with the phone in different attach-ments, e.g. in hand, pocket and handbag respectively, cover-ing everyday behaviors. Our training dataset doesn’t con-tain data from three of these participants. The performanceof our model is measured as error cumulative distributionfunction (CDF) against Vicon ground truth and comparedwith conventional PDR and SINS. Figure 4 illustrates thatour proposed approach outperforms the competing methodsin every attachment. If raw data is directly triply integratedby SINS, its error propagates exponentially. The maximumerror of our IOnet stayed around 2 meter within 90% testingtime, seeing 30%- 40% improvement compared with tradi-tional PDR in Figure 10a.

-3 -2 -1 0 1 2 3East(m)

-2

-1

0

1

2

3

4

5N

orth

(m)

Groud Truth (Vicon)Start PointEnd Point

(a) Ground Truth

-3 -2 -1 0 1 2East(m)

-2

-1

0

1

2

3

4

5

Nor

th(m

)

IONetStart PointEnd Point

(b) IONet

-3 -2 -1 0 1 2 3East(m)

-2

-1

0

1

2

3

4

5

Nor

th(m

)

TangoStart PointEnd Point

(c) Tango

Figure 9: Trolley tracking trajectories of (a) Ground Truth (b) IONet (c) Tango

(a) Multiple Users (b) Multiple Phones

Figure 10: Maximum position error within 90% test time

Another group of experiments is to test the performanceacross different devices, shown in Figure 5. We choose an-other two very common consumer phones, iPhone 6 andiPhone 5, whose IMU sensors, InvenSense MP67B and STL3G4200DH, are quite distinct from our training device,iPhone 7 (IMU: InvenSense ICM-20600). Although intrin-sic properties of IMU influence the quality of inertial mea-surements, our neural network shows good robustness.

Large-scale Indoor LocalizationHere, we apply our model on more challenging indoor local-ization experiment to present its performance in a new envi-ronment. Our model without training outside Vicon room, isdirectly applied to six large-scale experiments conducted ontwo floors of an office building. The new scenarios containedlong straight lines and slopes, which were not contained inthe training dataset. Lacking the high precise reference fromVicon, we take Google Tango Tablet (Tango 2014), a famousvisual-inertial device, as pseudo ground truth.

The floor maps are illustrated in Figure 6 (about 1650 m2

size) and Figure 7 (about 2475m2). Participants walked nor-mally along corridors with the phone in three attachments re-spectively. The predicted trajectories from our proposal arecloser to Tango trajectories, compared with the two otherapproaches in Figure 6 and 7. The continuously propagating

0

5

10

15

20

25

50 100141 50 100141 50 100141

Erro

r (m

)

Distance (m)

IONet PDR

Handheld Pocket Handbag

(a) Floor A

0

2

4

6

8

10

12

14

16

18

50 100154 50 100154 50 100154

Erro

r (m

)

Distance (m)

IONet PDR

Handheld Pocket Handbag

(b) Floor B

Figure 11: Position error in large-scale indoor localization

error of SINS mechanism caused trajectory drifts that growexponentially with time. Impacted by wrong step detectionor inaccurate step stride and heading estimation, PDR accu-racy is limited. We calculate absolute position error againstpseudo ground truth from Tango at a distance of 50m, 100mand the end point in Figure 11. Our IONet shows compet-itive performance over traditional PDR and has the advan-tage of generating continuous trajectory at 10 Hz, though itsheading attitude deviates from true values occasionally.

Trolley TrackingWe consider a more general problem without periodic mo-tion, which is hard for traditional step-based PDR or SINSon limited quality IMU. Tracking wheel-based motion, suchas a shopping trolley/cart, robot or baby-stroller is highlychallenging and hence under-explored. Current approachesto track wheeled objects are mainly based on visual odom-etry or visual-inertial odometry (VIO) (Li and Mourikis2013; Bloesch et al. 2015). They won’t work when the de-vice is occluded or operating in low light environments,such as placed in a bag. Moreover, their high energy-and computation-consumption also constrain further appli-cation. Here, we apply our model on a trolley tracking prob-lem using only inertial sensors. Due to a lack of compara-

0 0.5 1 1.5 2 2.5 3 3.5 4Error(m)

0

0.2

0.4

0.6

0.8

1

CD

F(%

)

IONet (Our proposal)Tango (Visual Inertial Odometry)

VIOCollapsed

Figure 12: CDF of Trolley Tracking

ble technique, our proposal is compared with the state-of-artvisual-inertial odometry Tango.

Our experiment devices, namely an iPhone 7 and theGoogle Tango are attached on a trolley, pushed by a partic-ipant. Detailed experiment setup and results could be foundin supplementary video 3. High-precision motion referencewas provided by Vicon. From the trajectories from Vicon,our IONet and Tango in Figure 9, our proposed approachshows almost the same accuracy as Tango, and even betterrobustness, because our pure inertial approach suffers lessfrom environmental factors. With the help of visual features,VIO (Tango) can constrain error drift by fusing visual trans-formations and the inertial system, but it will collapse whencapturing wrong features or no features, especially in openspaces. This happened in our experiment, shown in Figure12. Although VIO can recover from the collapse, it still lefta large distance error.

Conclusion and Future WorkWe presented a new neural network framework to learn in-ertial odometry directly from IMU raw data. Our model isderived from Newtonian mechanics and formulated as a se-quential learning problem using deep recurrent neural net-works to recover motion characteristics. The performanceof IONet is evaluated through extensive experiments includ-ing tests involving multiple users/devices, large-scale indoorlocalization and trolley tracking. It outperforms both tradi-tional step-based PDR and SINS mechanisms. We believeour work lays foundations for a more accurate and reliableindoor navigation system fusing multiple sources.

A number of research challenges lie ahead: 1) The per-formance of our proposed IONet degraded when the inputIMU measurements are corrupted with a high bias value,because the training and testing data are not in the same do-mains. 2) Challenging motions and distinct walking habitsof new users can influence the effectiveness of our proposedapproach. 3) Our current model is trained and tested on dif-ferent attachments separately. The challenge becomes howto jointly train all the attachments on one model and improveits robustness. 4) Combining the sequence-based physical

3Video is available at: https://youtu.be/pr5tR6Wz-zs

model with the proposed neural network model would be aninteresting point to investigate in future.

Therefore, our future work includes extending our workto learn transferable features without the limitations of mea-surement units, users and environments, realized on all plat-forms including mobile devices and robots.

AcknowledgmentsWe would like to thank the anonymous reviewers for theirvaluable comments. We also thank our colleagues, Dr. SenWang, Dr. Stepheno Rosa, Shuyu Lin, Linhai Xie, Bo Yang,Jiarui Gan, Zhihua Wang, Ronald Clark, Zihang Lai, fortheir help, and Prof. Hongkai Wen at University of Warwickfor useful discussions and the usage of GPU computationalresources.

References[Bloesch et al. 2015] Bloesch, M.; Omari, S.; Hutter, M.; andSiegwart, R. 2015. Robust visual inertial odometry usinga direct EKF-based approach. In IEEE International Con-ference on Intelligent Robots and Systems, volume 2015-Decem, 298–304.

[Brajdic and Harle 2013] Brajdic, A., and Harle, R. 2013.Walk detection and step counting on unconstrained smart-phones. Ubicomp 13.

[Clark et al. 2017a] Clark, R.; Wang, S.; Markham, A.;Trigoni, N.; and Wen, H. 2017a. VidLoc: 6-DoF Video-Clip Relocalization. In IEEE International Conference onComputer Vision and Pattern Recognition (CVPR’17).

[Clark et al. 2017b] Clark, R.; Wang, S.; Wen, H.; Markham,A.; and Trigoni, N. 2017b. VINet : Visual-Inertial Odometryas a Sequence-to-Sequence Learning Problem. AAAI 3995–4001.

[Dai and Le 2015] Dai, A. M., and Le, Q. V. 2015. Semi-supervised Sequence Learning. In Advances in Neural In-formation Processing Systems (NIPS 2015), 30793087.

[Donahue et al. 2015] Donahue, J.; Hendricks, L. A.;Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Darrell,T.; and Saenko, K. 2015. Long-term recurrent convolutionalnetworks for visual recognition and description. Proceed-ings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition 07-12-June:2625–2634.

[Graves and Jaitly 2014] Graves, A., and Jaitly, N. 2014. To-wards End-to-End Speech Recognition with Recurrent Neu-ral Networks. Proceedings of the 31 st International Con-ference on Machine Learning (ICML) 32(1):17641772.

[Greff et al. 2016] Greff, K.; Srivastava, R. K.; Koutnik, J.;Steunebrink, B. R.; and Schmidhuber, J. 2016. LSTM: ASearch Space Odyssey. IEEE Transactions on Neural Net-works and Learning Systems.

[Harle 2013] Harle, R. 2013. A Survey of Indoor InertialPositioning Systems for Pedestrians. IEEE COMMUNICA-TIONS SURVEYS & TUTORIALS 15(3):1281–1293.

[Hausdorff 2007] Hausdorff, J. M. 2007. Gait dynamics,fractals and falls: Finding meaning in the stride-to-stride

fluctuations of human walking. Human Movement Science26(4):555–589.

[Hilsenbeck et al. 2014] Hilsenbeck, S.; Bobkov, D.;Schroth, G.; Huitl, R.; and Steinbach, E. 2014. Graph-basedData Fusion of Pedometer and WiFi Measurements forMobile Indoor Positioning. UbiComp 14 147–158.

[Hippe 2006] Hippe, P. 2006. Windup in Control.[Hooman et al. 2017] Hooman, J.; Roever, W.-p. D.; Pandya,P.; Xu, Q.; Zhou, P.; and Schepers, H. 2017. A Composi-tional Object-Based Approah to Learning Physical Dynam-ics. ICLR 1:155–173.

[Karl et al. 2016] Karl, M.; Soelch, M.; Bayer, J.; and van derSmagt, P. 2016. Deep Variational Bayes Filters: Unsuper-vised Learning of State Space Models from Raw Data. Pro-ceedings of the Intuitive Physics Workshop at NIPS 2016(i):2–5.

[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015.Adam: A Method for Stochastic Optimization. In Interna-tional Conference on Learning Representations (ICLR), 1–15.

[Leutenegger et al. 2015] Leutenegger, S.; Lynen, S.; Bosse,M.; Siegwart, R.; and Furgale, P. 2015. Keyframe-basedvisualinertial odometry using nonlinear optimization. TheInternational Journal of Robotics Research 34(3):314–334.

[Li and Mourikis 2013] Li, M., and Mourikis, A. I. 2013.High-precision, consistent EKF-based visual-inertial odom-etry. The International Journal of Robotics Research32(6):690–711.

[Li et al. 2012] Li, F.; Zhao, C.; Ding, G.; Gong, J.; Liu, C.;and Zhao, F. 2012. A reliable and accurate indoor local-ization method using phone inertial sensors. UbiComp 12421–430.

[Lymberopoulos et al. 2015] Lymberopoulos, D.; Liu, J.;Yang, X.; Choudhury, R. R.; Handziski, V.; and Sen, S.2015. A Realistic Evaluation and Comparison of IndoorLocation Technologies: Experiences and Lessons Learned.IPSN 2015 (Table 1):178–189.

[Naser 2008] Naser, El-Sheimy; Haiying, H. X. N. 2008.Analysis and Modeling of Inertial Sensors Using Allan Vari-ance. IEEE TRANSACTIONS ON INSTRUMENTATIONAND MEASUREMENT 57(JANUARY):684–694.

[Ondruska and Posner 2016] Ondruska, P., and Posner, I.2016. Deep Tracking: Seeing Beyond Seeing Using Re-current Neural Networks. In Proceedings of the Thirti-eth AAAI Conference on Artificial Intelligence (AAAI-16),3361–3367.

[Savage 1998] Savage, P. G. 1998. Strapdown Inertial Nav-igation Integration Algorithm Design Part 1: Attitude Al-gorithms. Journal of Guidance, Control, and Dynamics21(1):19–28.

[Shu et al. 2015] Shu, Y.; Shin, K. G.; He, T.; and Chen, J.2015. Last-Mile Navigation Using Smartphones. In Pro-ceedings of the 21st Annual International Conference onMobile Computing and Networking, MobiCom, 512–524.

[Skog et al. 2010] Skog, I.; Handel, P.; Nilsson, J.-O.; andRantakokko, J. 2010. Zero-velocity detection — an algo-

rithm evaluation. IEEE transactions on bio-medical engi-neering 57(11):2657–2666.

[Srivastava et al. 2014] Srivastava, N.; Hinton, G.;Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R.2014. Dropout: A Simple Way to Prevent Neural Networksfrom Overfitting. Journal of Machine Learning Research15:1929–1958.

[Stewart and Ermon 2017] Stewart, R., and Ermon, S. 2017.Label-Free Supervision of Neural Networks with Physicsand Domain Knowledge. AAAI 2017 1(1):1–7.

[Tango 2014] Tango. 2014. Google Tango Tablet.[Vicon 2017] Vicon. 2017. ViconMotion Capture Systems:Viconn.

[Wang et al. 2016] Wang, S.; Wen, H.; Clark, R.; andTrigoni, N. 2016. Keyframe based Large-Scale IndoorLocalisation using Geomagnetic Field and Motion Pattern.In IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS), 1910–1917.

[Wang et al. 2017] Wang, S.; Clark, R.; Wen, H.; andTrigoni, N. 2017. DeepVO : Towards End-to-End VisualOdometry with Deep Recurrent Convolutional Neural Net-works. In IEEE International Conference on Robotics andAutomation (ICRA).

[Xiao et al. 2014a] Xiao, Z.; Wen, H.; Markham, A.; andTrigoni, N. 2014a. Lightweight map matching for indoorlocalization using conditional random fields. 2014 Interna-tional Conference on Information Processing in Sensor Net-works, IPSN 2014 131–142.

[Xiao et al. 2014b] Xiao, Z.; Wen, H.; Markham, A.; andTrigoni, N. 2014b. Robust pedestrian dead reckoning ( R-PDR ) for arbitrary mobile device placement. In IPIN.

[Xiao et al. 2015] Xiao, Z.; Wen, H.; Markham, A.; andTrigoni, N. 2015. Robust indoor positioning with lifelonglearning. IEEE Journal on Selected Areas in Communica-tions 33(11):2287–2301.

[Zhou et al. 2017] Zhou, T.; Brown, M.; Snavely, N.; andLowe, D. G. 2017. Unsupervised Learning of Depth andEgo-Motion from Video. CVPR 2017.

IONet: Learning to Cure the Curse of Drift in Inertial Odometry to Cure the Curse of...Pedestrian...

Documents

Transcript of IONet: Learning to Cure the Curse of Drift in Inertial Odometry to Cure the Curse of...Pedestrian...