key: cord-0674804-scfhwdtr authors: Herath, Sachini; Caruso, David; Liu, Chen; Chen, Yufan; Furukawa, Yasutaka title: Neural Inertial Localization date: 2022-03-29 journal: nan DOI: nan sha: 29badc501211e0c11e55e4d4ca7e72a5eb8d2494 doc_id: 674804 cord_uid: scfhwdtr This paper proposes the inertial localization problem, the task of estimating the absolute location from a sequence of inertial sensor measurements. This is an exciting and unexplored area of indoor localization research, where we present a rich dataset with 53 hours of inertial sensor data and the associated ground truth locations. We developed a solution, dubbed neural inertial localization (NILoc) which 1) uses a neural inertial navigation technique to turn inertial sensor history to a sequence of velocity vectors; then 2) employs a transformer-based neural architecture to find the device location from the sequence of velocities. We only use an IMU sensor, which is energy efficient and privacy preserving compared to WiFi, cameras, and other data sources. Our approach is significantly faster and achieves competitive results even compared with state-of-the-art methods that require a floorplan and run 20 to 30 times slower. We share our code, model and data at https://sachini.github.io/niloc. Imagine one stands up, walks for 3 meters, turns right, and opens a door in an office. This information might be sufficient to identify the location of the individual. A recent * Corresponding author sherath@sfu.ca. (This work was partially done when Sachini was an intern at Meta.) breakthrough in inertial navigation [10, 14, 22] allows us to obtain such motion history using an inertial measurement unit (IMU). What is missing is the technology that maps a motion history to a location. This papers addresses this gap, seeking to open a new paradigm in the localization research, named "inertial localization", whose task is to infer the location from a sequence of IMU sensor data. Indoor localization is a crucial technology for locationaware services, such as mobile business applications for consumers, entertainment (e.g., Pokemon Go) for casual users, and industry verticals for professional operators (e.g., maintenance at a factory). State-of-the-art indoor localization systems [5] mostly rely on WiFi, whose infrastructure is ubiquitous thanks to the demands on Internet of Things (IoT). Nevertheless, accuracy of WiFi based localization depends on infrastructure (i.e number of access points) thus cannot scale easily to non-commercial private spaces. IMU is a powerful complementary modality to WiFi, which has proven effective for the navigation task recently [10, 14, 22] . IMU 1) works anytime anywhere (e.g., inside a pocket/bag/hand); 2) is energy efficient to be an always-on sensor and 3) protects the privacy of bystanders. This paper introduces a novel inertial localization problem as a task of estimating the location from a history of IMU measurements. The paper provides the first inertial localization benchmark, consisting of 53 hours of motion data and ground-truth locations over 3 buildings. The paper also proposes an effective solution to the problem, dubbed neural inertial localization (NILoc). NILoc first uses a neural inertial navigation technique [10] to turn IMU sensor data into a sequence of velocity vectors, where the remaining task is to map a velocity sequence to a location. The high uncertainty in this remaining task is the challenge of inertial localization. For instance, a stationary motion can be anywhere, and a short forward motion can be at any corridor. To overcome the uncertainty, our approach employs a Transformer-based neural architecture [27] (capable of encoding complex long sequential data) with a Temporal Convolutional Network (further expanding the temporal capacity by compressing the input sequence length) and an auto-regressive decoder (handling arbitrarily long sequential data). The contributions of the paper are 3-fold: 1) a novel inertial localization problem, 2) a new inertial localization benchmark, and 3) an effective neural inertial localization algorithm. We will share our code, models and data. Outdoor navigation predominately uses satellite GPS. Indoor localization often relies on multiple data sources such as images, WiFi, magnet, or IMU. We review indoor localization techniques based on the input modalities. Image based localization estimates a camera's 6DoF pose from a query image. A classical approach is to detect feature pixels, establish 2D-to-3D or 2D-to-2D correspondences, and solve a Perspective-n-Point (PnP) problem [23] . The surge of deep learning allows us to learn these steps by an end-to-end network [30] . Another family of neural architectures directly regress pose parameters [13] . InLoc [24] , a system for imagebased indoor localization, reports localization error below 1 meter for 69.9% of queries. While being precise, the image modality suffers from a few major drawbacks for serving mobile applications: a camera needs a direct line of sight, consumes significant amount of battery, and reveals information about bystanders. Wireless localization based on WiFi or Bluetooth is the current main-stream modality for indoor localization [3, 21, 29] . Wireless receivers work anytime anywhere and WiFi infrastructure is ubiquitous thanks to the ever-growing IoT market demands. The wireless modality is not as precise as the image-based approach and reports a minimum 10 meter error radius [9] . This paper studies inertial localization as an effective complementary modality. Activity and magnetic-fields are other modalities for indoor localization. Activity recognition based on IMU sensor data provides cues on the locations through a predetermined mapping between activity types and locations [12, 31] . Location specific magnetic-field distortions can also be used to build a localization system through a site survey [1, 18, 26] . IMU and floorplan fusion allows classical filtering methods (e.g., particle filter) to perform localization [25] by using inertial navigation to propagate particles and the floorplan to re-weight particles. This approach is sensitive to cumulative sensor errors in inertial navigation. Correlation between a short motion history (five seconds) and floorplan can provide additional prior to weigh particles [16] but requires start location and orientation to initialize the system. We employ a novel Transformer-based neural architecture that regresses the location from long motion history even under severe bending. Our approach does not require a floorplan, which often misses transient objects (e.g., chairs/desks) and needs occasional updates, thus provide a compelling alternative. Inertial navigation estimates relative motions from IMU sensor data, that is, linear accelerations from an accelerometer and angular velocities from a gyroscope. Deep learning has made significant progress in recent years, producing accurate 3DoF [10] and 6DoF [14, 22] trajectories by learning repetitive human body motions with Convolutional Neural Network (CNN) [8] or LSTM [11] . The main source of error is the bias in consumergrade gyroscopes, accumulating into significant orientation errors, which becomes as large as 20 degrees with 5 minutes of motions even after the calibration. Our inertial localization algorithm first uses inertial navigation to estimate a sequence of velocity vectors from IMU data. Inertial localization is the task of estimating the location of a subject in an environment, solely from a history of IMU sensor data. There is a training phase and a testing phase i.e. without a use of a floorplan or external location information. During testing, an input is a sequence of acceleration (accelerometer), angular velocity (gyroscope), and optionally magnetic field (compass) measurements, each of which has 3 DoFs. An output is position estimations for a given set of timestamps, when ground-truth positions are available. During training, we have a set of input IMU sensor data and output positions. Metrics: The localization accuracy is measured by 1) the ratio (%) of correct position estimations within a distance threshold (1, 2, 4, or 6 meters) [24] ; and 2) the ratio (%) of correct velocity directions within an angular threshold (20 or 40 degrees). The position ratio is the main metric, while the direction ratio measures the temporal consistency. Re-localization task extension: We propose an inertial re-localization task, which is different from inertial localization in that the position R 2 (and optionally the motion direction SE (2)) is known apriori. The task represents a scenario where one uses WiFi to obtain a global position once in a few minutes, while re-localizing oneself in-between with an IMU sensor for energy efficiency. We present the first inertial localization dataset, containing 53 hours of motion/trajectory data from two university buildings and one office space. Table 1 summarizes the dataset statistics, while Fig. 2 visualizes all the ground-truth trajectories overlaid on a floorplan. Each scene spans a flat floor and the position is given as a 2D coordinate without the vertical displacement. If available, a floorplan image is provided for a scene for qualitative visualization, which depicts architectural structures (e.g., walls, doors, and windows) but does not contain transient objects such as chairs, tables, and couches. Data collection: We collect IMU sensor data and groundtruth locations with smartphones. In the future, AR devices (e.g., Aria glasses by Meta, Spectacles by Snap) will allow collection of ego-centric datatsets with tightly coupled IMU and camera data. We used two devices in this work; 1) a handheld 3D tracking phone (Google Tango, AsusZenfone AR) with built-in Visual Inertial SLAM capability, producing ground-truth relative motions, where the Z axis is aligned with the gravity; and 2) a standard smartphone, recording IMU sensor data under natural phone handling (e.g. in a pocket, hand or used for calling etc.). We utilize Tango Area Description Files [6] to align ground-truth trajectories to a common coordinate frame then manually align with the floorplan. University A contains data from RoNIN dataset [10] aligned manually to a floorplan. Both IMU sensor data and ground-truth positions are recorded at 200Hz. Test sequences: We randomly select one sixth of the trajectories as the testing data, whose average duration is 13.3 minutes. We also form a set of short fixed-length sub-sequences for testing by randomly cropping three subsequences (100 meters) from each testing sequence. NILoc dataset is de-identified to mask the identity of subjects and does not contain any image or video data. Instead of regressing locations from IMU measurements, our system NILoc capitalizes on neural inertial navigation technology [10] that turns a sequence of IMU sensor data to a sequence of velocity vectors, where our core task will be to turn velocity vectors into location estimations. 1 Figure 3 . Neural inertial localization system diagram. We use two branch transformer architecture to estimate location likelihood from velocity input. The paths only used in training are shown as dashed lines. High uncertainty is the challenge in the task. NILoc employs a neural architecture with two Transformer-based network branches [27] , capable of using long history of complex motion data to reduce uncertainty. The "velocity branch" encodes a sequence of velocity vectors, where a Temporal Convolutional Network compresses temporal dimension to further augment the temporal receptive field. The "auto-regressive location branch" encodes a sequence of location likelihoods, capable of autoregressively producing location estimates on a long horizon. The network is trained per scene given training data. The section explains the two branches (Secs. 5.1 and 5.2), the training scheme (Sec. 5.3), and the data augmentation process (Sec. 5.4), which proves effective in the absence of sufficient training data. The branch estimates a location sequence using a history of velocity data. It consists of three network modules: TCN-based velocity compressor, Transformer velocity encoder, and Translation-aware location decoder. TCN-based velocity compressor: Transformer is powerful but memory intensive. We use a temporal convolutional network (TCN) [2] to compress a velocity sequence length by a factor of 10, allowing us to process longer motion history. In particular, we use a 2-layer TCN with a receptive field of 10 to compress a sequence of 2D velocity vectors {v t } of length T into a sequence of d-dimensional 2 feature vectors {v ′ t } of length T/10: Transformer encoder: Transformer architecture [27] takes compressed velocity vectors {v ′ t } as tokens and initializes each feature vector f t by concatenating the d/2 dimensional trigonometric position encoding of the frame index: An output embedding e t per token is also a d ′ dimensional vector encoding the location likelihood. The encoder has two blocks of self-attention networks. Each block has 2 standard transformer encoder layers with 8-way multi-head attention. Feature vectors after the first block are also passed to the other branch (i.e., auto-regressive location branch). Translation-aware location decoder: The last module operates on each individual embedding e t . First, e t is rearranged into an image feature volume (3D tensor 3 ) and up-sampled via a 3-layer fully convolutional decoder with transpose convolutions. The last layer is a "translationaware" 1 × 1 convolution, whose parameters are not shared across pixels. To account for uncertainty, the output location is represented as a 2D likelihood map L t of size W ×H: L(x, y). 4 This translation-aware layer allows the network to easily learn translation-dependent information such as "people never come to this location" or "one always pass through this doorway". The location branch combines the velocity features from the velocity branch and prior location likelihoods, which comes from its past inference or an external position information such as WiFi. The location branch has the same architecture as the velocity branch with two differences. First, instead of a TCN-based velocity compressor, we use a ConvNet to convert each W ×H likelihood map into a d ′ -dimensional vector. We use the same trigonometric position encoding (but with dimension d ′ instead of d/2 to match the dimension), which is added to the vector. Second, we inject velocity features from the velocity branch via crossattention after every self-attention layer (i.e., before every add-norm layer). The rest of the architecture is the same. Note that both branches predict locations, and have different trade-offs (See Sect. 6.4 for ablation study and discussion). At inference time, we first evaluate the velocity branch in a sliding window fashion to compute velocity feature vectors. The location branch takes a history of location likelihoods up to 20 frames: {L t , L t−1 , · · · L t−19 }. L 0 encodes external initial location information (e.g., from WiFi) or a uniform distribution if not available. At the output, a node initialized with a likelihood at frame t ′ will have a likelihood estimate at frame t ′ + 1. Therefore, we infer a likelihood up to 20 times for one frame, where we compute the weighted average as the final likelihood by decreasing the weights from 1.0 down to 0.05 from the first inference result to the last. We use a cross entropy loss at both branches. The ground-truth likelihood is a zero-intensity image, except for one pixel at the ground-truth location whose value is 1.0. We employ parallel scheduled sampling [17] to train the auto-regressive location branch without unrolling recurrent inferences. The process has two steps. First, we pass GT likelihoods to all the input tokens and make predictions. Second, we keep the GT likelihoods in the input tokens with probability r teacher (known as a teacher-forcing ratio), while replacing the remaining nodes with the predicted likelihoods. The back-propagation is conducted only in the second step. r teacher is set to 1.0 in the first 50 epochs, and reduced by 0.01 after every 5 epochs. The Transformer architecture requires a large amount of training data. 5 We crop data over different time windows 5 The COVID pandemic further makes the data collection challenging. to augment training samples. However, in the absence of sufficient training data, we use the following three steps to generate more training samples synthetically: 1) Compute a likelihood map of training trajectories (i.e., where they pass through); 2) Randomly pick a pair of locations from high likelihood areas; and 3) Solve an optimization problem to produce a trajectory that is smooth and pass through the area of high likelihood. Given a synthesized trajectory, we sample velocity vectors based on the distance of travel as in our preprocessing step, which are directly passed to the TCN-based velocity compressor during training. All the steps are standard heuristics and we refer the details to the supplementary. To our knowledge, no prior work addresses indoor localization from IMU data alone. Therefore, we compare with the following three techniques that fuse IMU and floorplan. 6 Note that our method is the only inertial localization that uses IMU data alone without the floorplan images. We briefly explain the three techniques. • Particle Filter (PF) maintains a set of particles, each of which stores the location, heading direction, and bias/scale error correction terms. Starting from a Gaussian distribution around the given initial location or a uniform distribution otherwise, the system updates the states of the particles based on the inertial navigation result and the floorplan information (i.e., down-weight particles if outside the walkable region). We take the particle closest to the weighted median x/y coordinate as the location prediction. • Learned Prior (LP) [16] is also a particle filter based approach while using deep networks to help update particle weights. A location likelihood, computed by the dot product between floorplan features extracted by UNet [20] and motion features extracted by LSTM [11] , is used to weight the particles. We use our local implementation as the code is not available. Note that this method requires the initial location and the orientation and is evaluated only for the SE(2) re-localization task. • Conditional Random Field (CRF) is based on a stateof-the-art map-matching system [28] , which computes a reach-ability graph from a floorplan, uses inertial navigation results to transition between graph nodes, and uses Viterbi algorithm to backtrack and determine location. We modified the system in a few ways to better adapt to our Table 3 . Ablation study. The first row is a inertial navigation algorithm. Next four rows are results after dropping one technical component from our full system. For the 4rd and the 5th rows, we dropped the translation-aware location decoder and replace by either a fully connected layer (FC) or a fully convolutional decoder (CNN). The last two rows compare predictions by the velocity branch and the location branch, where the latter is the default prediction reported as our result in the other places. The success rate (%) at two distance thresholds (m) on building A are the metrics. task: 1) use the RoNIN result as the velocity input; 2) increasing the search neighborhood by a factor of 1.5 to handle scale inaccuracies in inertial navigation; and 3) do periodic back-propagation within a fixed window to be comparable with other near-real time baselines. We have implemented the proposed system in Pytorch-Lightning [4] . For training, we have used AdamW optimizer [15] . The learning rate has been initialized to 0.0001 with 30 epoch warm-up [7] , and reduced by a factor of 0.75 after every 10 epochs when the validation loss does not decrease. We use random one sixth of the training trajectories as a hold-out validation. In the main experiments we combine real and synthetic data for training. Specifically, we train on the combined set until convergence (700 epochs) and fine-tune only with real dataset (400 epochs). The total training time is approx. 3 days on GeForce RTX 2080 Ti and 1 day on NVIDIA Tesla V100 GPUs. For RoNIN software, we have downloaded the trained model from the official website [10] . Note that unless otherwise noted, our results indicate the predictions by the auto-regressive location branch. 7 For the baseline methods, we use Pytorch [4] and CuPy [19] to implement both on CPU and GPU. For being fair, we use the same floorplan resolution and the same distancebased sampling to extract velocities. A grid search is used to find hyper parameters for all baselines. Table 2a is our main result, quantitative evaluations on the localization task for the three buildings separately. NILoc achieves the best results in most entries. The only exception is against CRF for building A for fixed short sequences. Note that CRF uses floorplan information while our input is only IMU. CRF is computationally intensive, 30 times slower than our method, involving even a dynamic programming to effectively search for all possible alignments with a floorplan exhaustively. Table 2b shows the results on the re-localization task, averaged over the three buildings. All methods consistently improves performance with more initialization information. While NILoc achieves the best results for lower distance thresholds (i.e., often extremely accurate), CRF performs better overall at the sacrifice of its intensive computational expenses and the requirement on the floorplan image. Figure 5 observes the same results, plotting the distancebased success rate over a range of thresholds (averaged over the three buildings). Particle filter exhibits poor performance, where the major limitation comes from its inability to handle cumulative drifts in the inertial navigation trajectories. Learned prior combines particle filter and neural networks that learn to associate motions with locations. However, they only encode 5 seconds of motion data with LSTM and ConvNet, not long enough to break the ambiguity. NILoc takes roughly a minute of motion data with powerful Transformer based architecture, overcoming the uncertainties. Another observation is that all the baseline methods explicitly integrate velocities to update the location information. NILoc does not bake-in this integration formula and relies completely on learning to relate velocities to locations, which could potentially make our approach more robust against cumulative drifts by inertial navigation. Figure 4 provides qualitative visualizations of one trajectory from building A. Particle filter and CRF estimate locations at roughly 200Hz, while NILoc is roughly 20Hz. We interpolate NILoc locations in plotting the trajectories, where streaks of points in the figure are interpolation artifacts at discontinuous predictions. As shown by the inertial navigation trajectory at the bottom right, this is not an easy task where the trajectory suffers from significant cumulative drifts. Nonetheless, our system is capable of inferring correct locations for most of the frames. Table 3 is an ablation study, assessing the contributions of various technical components in our system. The first row compares against a state-of-art inertial navigation method which ours and all baselines outperform (also see Fig. 4 ). The next four rows show the distance-based success rate while dropping four components one by one from our main system. The table shows that it is important to train the network with losses on both branches. The second row is particularly interesting. Both the location and the (first half of the) velocity branches are trained with the loss only at the location branch, whose performance drops significantly. The third row indicates the challenge of high uncertainty in the inertial navigation task. The success rates even drop to a single digit, when the input motion history becomes 10 times less without the TCN based compressor. The last two rows compare the predictions by the velocity branch and the location branch of our full system. The velocity branch does not take previous location likelihoods and cannot solve re-localization tasks. However, for the localization tasks, it outperforms the location branch with a clear margin, while being twice as computationally efficient. There are two major failure modes in our approach. First, in an open space (e.g., an atrium), human motions tend not to follow patterns, where any location could be an answer. Second, in the presence of symmetries or repetitions, multiple locations would be equally likely. Our method is designed not to use any future frame information after the first input window so that it can be deployed as a real-time system, where predictions may jump abruptly under high uncertainties. Our future work is to exploit body motion signals that are captured in IMU but are currently discarded by inertial navigation and distance-based velocity sampling to overcome the uncertainty. For instance, IMU signals differ when one opens a door, washes hands, or orders a coffee, which provides effective cues in localizing the position. Please refer to the supplementary for more qualitative visualizations (i.e., location trajectories as static images and videos for more samples in more buildings) and quantitative ablation studies (e.g., w/o synthetic data, w/o scheduled sampling, or comparison against TCN as a backbone instead of Transformer). We share our code, models and data to promote further research in the space of inertial localization. Societal impact: Inertial localization could be a critical component for indoor GPS. A mobile app might become capable of recording location history 24/7 anywhere in indoor spaces, which tend to be more private than outdoors (e.g., inside a house or a rest room). A smartphone developer should understand the impact of giving IMU sensor data and define appropriate access controls for the apps. On the positive side, the inertial location could happen on-device, which would allow a higher degree of privacy control compared to other data modalities. Sachini Herath 1 David Caruso 2 Chen Liu 2 Yufan Chen 2 Yasutaka Furukawa 1 1 Simon Fraser University, BC, Canada 2 Reality Labs, Meta, Redmond, USA The supplementary document provides: • Algorithmic details (Sec. 1); • More qualitative visualization ( Fig. 1 and 2 ); • Failure cases (Fig. 3) ; • Architecture specification (Tab. 1); • Quantitative evaluations on re-localization task (Tab. 2); • More ablation studies (Tab. 3). Please also refer to our supplementary video, which shows predicted likelihoods and motion trajectories for each method in animtaion. We discuss the details of our synthetic data generation process and data augmentation during training, outlined in section 5.4. When a floorplan image is not available, we compute the walklable region or a floorplan (F map ) by 1) counting the number of times training trajectories pass through at each pixel; 2) clamping the count to be 2 at the maximum; 3) applying Gaussian smoothing (σ = 1.0 pixels); and 4) binary-thresholding the map with a threshold of 0.5. We generate a synthetic motion trajectory from the floorplan (F map ) by picking start and goal positions randomly, and finding a shortest path between them, while we define a local neighborhood system to be a 11x11 square region. Next, in order to make the trajectories look realistic, we apply B-spline smoothing then solve an optimization problem so that the trajectories pass through near the center of corridors and passages. Note that when synthetic trajectories are used in training, to minimize the gap between synthetic data and real data, we also apply the Bspline smoothing to the real trajectory before passing the velocity vectors to the velocity branch. B-spline smoothing: We approximate synthetic trajectory by fitting a b-spline curve [1] . Given a trajectory p t with timestamps t ∈ T , we find a smooth spline approximation of degree 3 with smoothing condition s(=5.0) by "scipy.interpolate.splrep". The b-spline knot vector and control points of the approximated curve are k and c. Optimizing with floorplan: Given a b-spline knot vector k and control points c of the approximated curve for synthetic data, and a walkable region, we optimize the location of control points, c m , using non-linear least squares to minimize the following energy function. arg min cm t∈T where f map is a function of distance from a non-walkable pixel, which is computed by a flood-fill algorithm and smoothed by the Gaussian function (σ = 2.0). We sample the smoothed spline at constant distance of 1 pixel with a Gaussian noise (σ = 3.0) to obtain velocity vectors and ground-truth positions. We further perform the following three data augmentation tricks. First, to ensure that motion directions are not restricted to a discrete set, we randomly rotate the walkable region image by "scipy.ndimage.rotate" function before generating synthetic trajectories by the shortest path algorithm. Second, we add random perturbations to velocity magnitude and angular rate of the input 2D velocity vector sequence by the Gaussian function (σ scale = 0.2 pixels and σ bias = 0.05 radians per frame) to mimic scaling errors in inertial navigation and IMU gyroscope bias errors. Third we perform a random in-plane rotation on the input velocity sequence to ensure the learned features are invariant to the unknown starting orientation alignment. Figure 1 . Qualitative visualizations: For one trajectory from Office C, we show results by the four methods (columns) for one localization and two re-localization tasks (rows). Particle filter, learned prior and CRF require a floorplan in addition to IMU input. The color gradient (blue → red → green) encodes time. We mark the physical dimension of each sequence and report success rate (%) at distance thresholds 1, 2,4, and 6 meters. Qualitative visualizations: For one trajectory from building B, we show results by the four methods (columns) for one localization and two re-localization tasks (rows). Particle filter, learned prior and CRF require a floorplan in addition to IMU input. The color gradient (blue → red → green) encodes time. We mark the physical dimension of each sequence and report success rate (%) at distance thresholds 1, 2, 4, and 6 meters. Table 2 . Evaluation per building for re-localization task (Table 2 .b of the main paper contains the average metrics across three buildings). We compare NILoc (ours) with three methods that require a floorplan as input: Particle Filter (PF), Learned Prior (LP) and Conditional Random Fields (CRF). We report success rate (SR) at a given error distance threshold and angle (A) threshold, per building. Run time is the average CPU or GPU time per 1 min of motion sequence. The best and second best results per column are shown in orange and cyan, respectively. Table 3 . Ablation Study: The first two rows are results after replacing transformer encoder module with different neural architectures, in particular LSTM and TCN. Third and fourth rows show the effectiveness of our training and data augmentation processes resp. The success rate (%) at two distance thresholds (m) on building A are the metrics. Gaussian processes for magnetic map-based localization in large-scale indoor environments An empirical evaluation of generic convolutional and recurrent networks for sequence modeling Smartphone inertial sensor-based indoor localization and tracking with ibeacon corrections William Falcon and The PyTorch Lightning team Fused Location Provider API Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour Deep residual learning for image recognition Fusion-DHL: Wifi, imu, and floorplan fusion for dense history of locations in indoor environments RoNIN: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods Long short-term memory Indoor positioning in large shopping mall with context based map matching Posenet: A convolutional network for real-time 6-dof camera relocalization TLIO: Tight learned inertial odometry Decoupled weight decay regularization Learnable spatio-temporal map embeddings for deep inertial localization Scheduled sampling for transformers Mail: Multi-scale attention-guided indoor localization using geomagnetic sequences CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations U-net: Convolutional networks for biomedical image segmentation Unsupervised crowd-assisted learning enabling location-aware facilities Idol: Inertial deep orientation-estimation and localization Computer vision: algorithms and applications InLoc: Indoor visual localization with dense matching and view synthesis Probabilistic robotics Ujiindoorloc-mag: A new database for magnetic field-based localization problems Attention is all you need Lightweight map matching for indoor localisation using conditional random fields WiFi-based indoor positioning Lift: Learned invariant feature transform ALIMC: Activity Landmark-Based Indoor Mapping via Crowdsourcing Algorithms for smoothing data with periodic and parametric splines The tensors are of batch size n, and the location branch is shown with maximum sequence length 20 corresponding to velocity input sequence length of 200 frames. For buildings B and C, velocity input shape remains the same and location input/output shapes change to Acknowledgements: The research is supported by NSERC 8 Discovery Grants, NSERC Discovery Grants Accelerator Supplements, and DND/NSERC Discovery Grant Supplement. We thank Weilian Song, Saghar Irandoust and Fuyang Zhang for their contribution to dataset.