Buckets:

huggingchat
/

papers-content

In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km times 300 kilometer 300\text{,}\mathrm{km}start_ARG 300 end_ARG start_ARG times end_ARG start_ARG roman_km end_ARG of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available athttps://go.vision.in.tum.de/4seasons.

keywords:

Autonomous Driving, Benchmark, Long-Term Visual Localization, SLAM, Visual Odometry, Camera Pose Estimation

1 Introduction

During the last decade, research on visual odometry (VO) and simultaneous localization and mapping (SLAM) has made tremendous strides(Newcombe et al, 2011; Engel et al, 2014; Mur-Artal et al, 2015; Engel et al, 2017), particularly in the context of autonomous driving(Engel et al, 2015; Wang et al, 2017a; Yang et al, 2018; Mur-Artal and Tardós, 2017). One reason for this progress has been the publication of large-scale datasets tailored for benchmarking these methods(Cordts et al, 2016; Geiger et al, 2013; Caesar et al, 2020). Nonetheless, existing algorithms have significant limitations. Most approaches are tailored to work well on small-scale datasets which exhibit limited challenging conditions.

Therefore, the next logical step towards progressing research in the direction of visual SLAM is to make it robust under dynamically changing and challenging conditions. This includes VO, e.g. at night or rain, as well as long-term place recognition and localization against a pre-built map. In this regard, the advent of deep learning has exhibited itself to be a promising potential in complementing the performance of visual SLAM(Dusmanu et al, 2019; Jung et al, 2019; von Stumberg et al, 2020; Jaramillo, 2017). Therefore, it has become all the more important to have datasets that are commensurate with handling the challenges of any real-world environment while also being capable of discerning the performance of state-of-the-art approaches.

Figure 1: 4Seasons benchmark dataset overview. Top: overlaid maps recorded at different times and environmental conditions. The 3D points from the reference map (black) align well with the 3D points from the query map (blue), indicating that the reference poses are accurate. Bottom: sample images demonstrating the diversity of our benchmark. The first row shows a collection from the same scene across different weather and lighting conditions: snowy, cloudy, sunny, and night. The second row depicts the variety of scenarios within the benchmark: inner city, suburban, countryside, and a parking garage.

To accommodate this demand, we present a cross-season and multi-weather benchmark, particularly focusing on visual SLAM and long-term localization for autonomous driving. This benchmark is based on the versatile large-scale 4Seasons dataset(Wenzel et al, 2020). To the best of our knowledge, we provide the first large-scale cross-season benchmark dataset comprising stereo images, corresponding high frame-rate inertial measurement unit (IMU), and accurate real-time kinematic (RTK) global navigation satellite system (GNSS) measurements to evaluate sequential localization methods. By traversing the same route under different conditions and over a long-term time horizon, we capture variety in illumination and weather, as well as in the appearance of the scenes. For each scenario, we provide multiple traversals exhibiting different environmental conditions, as described in Table5. The recordings show vastly different variations in the scene geometry including dynamic objects, roadworks, construction sites, and seasonal changes. To acquire accurate reference poses of large-scale scenes, we use a custom stereo-inertial sensor together with a RTK GNSS system to obtain up to centimeter-accurate poses. Figure1 visualizes two overlaid 3D reconstructions of the same scene recorded at different times. Moreover, the figure depicts sample images of the dataset used to evaluate six degrees of freedom (6DoF) localization against a prior map using query images taken from a variety of challenging conditions. We provide reference poses for a subset of the recordings, and withhold the remaining for an online evaluation benchmark suite. We design a benchmark to measure the impact of long-term environmental changes on the performance of visual SLAM and localization for autonomous driving.

The main contributions of this paper are the extensive benchmark suite for evaluating the long-term visual localization problem for autonomous driving, the evaluation of state-of-the-art baseline SLAM and visual localization algorithms, and the interpretation of the results.

This work extends our paper published at GCPR 2020(Wenzel et al, 2020) through the following additional contributions:

•We propose a large-scale cross-season and multi-weather benchmark suite for long-term visual SLAM in automotive applications. It allows the joint evaluation of visual odometry, global place recognition, and map-based visual localization approaches.
•We release plenty of additional sequences covering nine different types of environments, ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway.
•We provide an extensive evaluation of state-of-the-art baseline approaches for visual SLAM and visual localization on the presented benchmark.

2 Related Work

There exists a variety of benchmarks and datasets focusing on VO and SLAM for autonomous driving. Here, we divide these datasets into the ones which focus only on VO as well as those covering different weather conditions and therefore aiming towards long-term SLAM.

2.1 Visual Odometry Datasets & Benchmarks

The most popular benchmark for autonomous driving probably is KITTI(Geiger et al, 2013). This multi-sensor dataset covers a wide range of tasks including not only VO, but also 3D object detection and tracking, scene flow estimation as well as semantic scene understanding. The dataset contains diverse scenarios ranging from urban to countryside to highway. Nevertheless, all scenarios are only recorded once and under similar weather conditions. Ground truth is obtained based on a high-end inertial navigation system (INS).

Another dataset containing light detection and ranging (LiDAR), IMU, and image data at a large scale is the Málaga Urban dataset(Blanco-Claraco et al, 2014). However, in contrast to KITTI, no accurate 6DoF ground truth is provided, and therefore it does not allow for an appropriate quantitative evaluation. Moreover, only a few places are visited multiple times.

Other popular datasets for the evaluation of VO and visual-inertial odometry (VIO) algorithms that are not related to autonomous driving include(Sturm et al, 2012) (handheld RGB-D),(Burri et al, 2016) (UAV stereo-inertial),(Engel et al, 2016) (handheld mono), and(Schubert et al, 2018) (handheld stereo-inertial).

2.2 Long-Term SLAM Datasets & Benchmarks

More related to our work are datasets containing multiple traversals of the same environment over a long period. Concerning SLAM for autonomous driving, the Oxford RobotCar Dataset(Maddern et al, 2017) represents a kind of pioneer work. This dataset consists of large-scale sequences recorded multiple times in the same environment for one year. Hence, it covers large variations in the appearance and structure of the scene. However, the diversity of the scenarios is only limited to an urban environment. Also, the ground truth provided for the dataset is not accurate up to centimeter-level(Maddern et al, 2017; Spencer et al, 2020). Other existing datasets are lacking sequential structure(Kenk and Hassaballah, 2020), only provide a certain adverse condition(Pitropov et al, 2021), or focus on AR scenarios(Sarlin et al, 2022).

The work by(Sattler et al, 2018) proposes three complementary benchmark datasets based on existing datasets, namely RobotCar Seasons (based on(Maddern et al, 2017)), Aachen Day-Night (based on(Sattler et al, 2012)), and CMU Seasons (based on(Badino et al, 2011)) that have been used for benchmarking visual localization approaches. The ground truth of the RobotCar Seasons(Sattler et al, 2018) dataset is obtained based on structure from motion (SfM) and LiDAR point cloud alignment. However, due to inaccurate GNSS measurements(Maddern et al, 2017), a globally consistent ground truth up to centimeter-level accuracy can not be guaranteed. Furthermore, this dataset only provides one reference traversal in the overcast condition. In contrast, we provide globally consistent reference models for all training traversals covering a wide variety of conditions. Hence, every traversal can be used as a reference model that allows further research on, e.g. analyzing suitable reference-query pairs for long-term localization and mapping.

Global place recognition datasets such as Pittsburgh(Torii et al, 2013), Tokyo 24/7(Torii et al, 2015), and Mapillary Street-Level Sequences(Warburg et al, 2020) provide only coarse-scale location information. Other related localization datasets include 12-Scenes(Valentin et al, 2016), InLoc(Taira et al, 2018), Cambridge Landmarks(Kendall et al, 2015), and CrowdDriven(Jafarzadeh et al, 2021).

2.3 Other Datasets

Examples of further multipurpose autonomous driving datasets that also can be used for VO are(Cordts et al, 2016; Wang et al, 2017b; Huang et al, 2018; Caesar et al, 2020).

As stated in Section1, our proposed benchmark dataset differentiates from previous related work in terms of being both large-scale (similar to(Geiger et al, 2013)) and having high variations in appearance and conditions (similar to(Maddern et al, 2017)). Furthermore, accurate reference poses based on the fusion of direct stereo VIO and RTK GNSS are provided. To the best of our knowledge, we are the first to introduce a public, modular benchmark for evaluating visual SLAM, global place recognition, and map-based visual localization approaches under challenging conditions for autonomous driving.

3 System Overview

(a) Test vehicle.

(b) Sensor system.

Figure 2: Recording setup. Test vehicle and sensor system used for dataset recording. The sensor system consists of a custom stereo-inertial sensor with a stereo baseline of 30 cm times 30 centimeter 30\text{,}\mathrm{cm}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG and a high-end RTK GNSS receiver from Septentrio.

This section presents the sensor setup which is used for data recording (Section3.1). Furthermore, we describe the calibration of the entire sensor suite (Section3.2) as well as our approach to obtain up to centimeter-accurate global 6DoF reference poses (Section3.3).

3.1 Sensor Setup

The hardware setup consists of a custom stereo-inertial sensor for 6DoF pose estimation, as well as a high-end RTK GNSS receiver for global positioning and global pose refinement. Figure2 shows our test vehicle equipped with the sensor system used for data acquisition.

3.1.1 Stereo-Inertial Sensor

The core of the sensor system is our custom stereo-inertial sensor. This sensor consists of a pair of monochrome industrial-grade global shutter cameras (Basler acA2040-35gm) and lenses with a fixed focal length of f=3.5 mm 𝑓 times 3.5 millimeter f=$3.5\text{,}\mathrm{mm}$italic_f = start_ARG 3.5 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG (Stemmer Imaging CVO GMTHR23514MCN). The cameras are mounted on a highly-rigid aluminum rail with a stereo baseline of 30 cm times 30 centimeter 30\text{,}\mathrm{cm}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG. On the same rail, a Precision MEMS IMU (Analog Devices ADIS16465) is mounted. The cameras and the IMU are triggered over an external clock generated by an field-programmable gate array (FPGA). Here, the trigger accounts for exposure compensations, meaning that the time between the centers of the exposure interval for two consecutive images is always kept constant (1/[frame rate]) independent of the exposure time itself.

Furthermore, based on the FPGA, the IMU is properly synchronized with the cameras. In the dataset, we record stereo sequences with a frame rate of 30 fps times 30 fps 30\text{,}\mathrm{f}\mathrm{p}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_fps end_ARG. We perform pixel binning with a factor of two and crop the image to a resolution of 800×400 800 400 800\times 400 800 × 400. This results in a field of view of approximately 77⁢°77°77 ⁢ ° horizontally and 43⁢°43°43 ⁢ ° vertically. The IMU is recorded at a frequency of 2000 Hz times 2000 hertz 2000\text{,}\mathrm{Hz}start_ARG 2000 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG. During recording, we guarantee an equal exposure time for the left and the right image of each stereo pair as well as a smooth exposure transition in highly dynamic lighting conditions, as it is favorable to visual SLAM. We provide those exposure times for each frame.

3.1.2 GNSS Receiver

For global positioning and to compensate drift in the VIO system, we utilize an RTK GNSS receiver (mosaic-X5) from Septentrio in combination with an Antcom Active G8 GNSS antenna. The GNSS receiver provides a horizontal position accuracy of up to 6 mm times 6 millimeter 6\text{,}\mathrm{mm}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_mm end_ARG by utilizing RTK correction signals. While the high-end GNSS receiver is used for accurate positioning, we use a second receiver connected to the time-synchronization FPGA to obtain GNSS timestamps for the sensors.

3.2 Calibration

3.2.1 Aperture and Focus Adjustment

The lenses used in the stereo system have both adjustable aperture and focus. Therefore, before performing the geometric calibration of all sensors, we manually adjust both cameras for a matching average brightness and a minimum focus blur(Hu and de Haan, 2006), across a structured planar target in 10 m times 10 meter 10\text{,}\mathrm{m}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG distance.

3.2.2 Stereo Camera and IMU

For the intrinsic and extrinsic calibration of the stereo cameras, as well as the extrinsic calibration and time-synchronization of the IMU, we use Kalibr 1 1 1https://github.com/ethz-asl/kalibr(Rehder et al, 2016). The stereo cameras are modeled using the Kannala-Brandt model(Kannala and Brandt, 2006), a generic camera model consisting of a total of eight parameters. We validated the calibration accuracy of each recording by performing a feature-based epipolar-line consistency check.

3.2.3 GNSS Antenna

Since the GNSS antenna does not have any orientation but has an isotropic reception pattern, only the 3D translation vector between one of the cameras and the antenna within the camera frame has to be known. This vector was measured manually for our sensor setup.

3.3 Ground Truth Generation

Reference poses (i.e. ground truth) for VO and SLAM should provide high accuracy in both local relative 6DoF transformations and global positioning. To fulfill the first requirement, we extend the state-of-the-art stereo direct sparse VO(Wang et al, 2017a) by integrating IMU measurements(Von Stumberg et al, 2018), achieving a stereo-inertial SLAM system offering average tracking drift around 0.6%times 0.6 percent 0.6\text{,}\mathrm{\char 37\relax}start_ARG 0.6 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the traveled distance.

To fulfill the second requirement, the poses estimated by our stereo-inertial system are fused with the RTK GNSS measurements using a global pose graph. We first estimate a Sim⁢(3)Sim 3\mathrm{Sim}(3)roman_Sim ( 3 ) transformation to globally align the camera positions in the VIO coordinate system to those in the GNSS coordinate system using the Kabsch–Umeyama algorithm(Umeyama, 1991). A transformation in Sim⁢(3)Sim 3\mathrm{Sim}(3)roman_Sim ( 3 ) is estimated instead of in SE⁢(3)SE 3\mathrm{SE}(3)roman_SE ( 3 ) to account for the global scale drift in the VIO system. Denoting the Lie-algebra of SE⁢(3)SE 3\mathrm{SE}(3)roman_SE ( 3 ) as 𝔰⁢𝔢⁢(3)𝔰 𝔢 3\mathfrak{se}(3)fraktur_s fraktur_e ( 3 ), each aligned camera pose 𝝃 w⁢i VIO∈𝔰⁢𝔢⁢(3)subscript superscript 𝝃 VIO 𝑤 𝑖 𝔰 𝔢 3\boldsymbol{\xi}^{\text{VIO}}{wi}\in\mathfrak{se}(3)bold_italic_ξ start_POSTSUPERSCRIPT VIO end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT ∈ fraktur_s fraktur_e ( 3 ) is added to the pose graph as a 𝔰⁢𝔢⁢(3)𝔰 𝔢 3\mathfrak{se}(3)fraktur_s fraktur_e ( 3 ) node, where 𝝃 w⁢i subscript 𝝃 𝑤 𝑖\boldsymbol{\xi}{wi}bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT defines a transformation from the i 𝑖 i italic_i-th camera coordinate system to the world coordinate system. The camera connections from the VIO sliding window (one connection corresponds to two cameras co-observing a part of the scene) are added as 𝔰⁢𝔢⁢(3)−𝔰⁢𝔢⁢(3)𝔰 𝔢 3 𝔰 𝔢 3\mathfrak{se}(3)-\mathfrak{se}(3)fraktur_s fraktur_e ( 3 ) - fraktur_s fraktur_e ( 3 ) edges, with the relative poses 𝝃 j⁢i VIO subscript superscript 𝝃 VIO 𝑗 𝑖\boldsymbol{\xi}^{\text{VIO}}{ji}bold_italic_ξ start_POSTSUPERSCRIPT VIO end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT as the measurements. If a camera pose has a valid corresponding GNSS pose, that is, the GNSS pose is available and the observed standard deviation of the position is smaller than a predefined threshold, the GNSS pose 𝐭 i∈ℝ 3 subscript 𝐭 𝑖 superscript ℝ 3\mathbf{t}{i}\in\mathbb{R}^{3}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is added to the pose graph as a fixed ℝ 3 superscript ℝ 3\mathbb{R}^{3}blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT node and an 𝔰⁢𝔢⁢(3)−ℝ 3 𝔰 𝔢 3 superscript ℝ 3\mathfrak{se}(3)-\mathbb{R}^{3}fraktur_s fraktur_e ( 3 ) - blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT edge is added. The energy function defined for the pose graph optimization is thus defined as:

E⁢(𝝃 w⁢i,…,𝝃 w⁢n)=𝐸 subscript 𝝃 𝑤 𝑖…subscript 𝝃 𝑤 𝑛 absent\displaystyle E(\boldsymbol{\xi}{wi},\dots,\boldsymbol{\xi}{wn})=italic_E ( bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT , … , bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_n end_POSTSUBSCRIPT ) =(1) ∑𝝃 j⁢i VIO∈ε(𝝃 j⁢i VIO∘𝝃 w⁢i−1∘𝝃 w⁢j)⊤⁢𝚺 j⁢i−1⁢(𝝃 j⁢i VIO∘𝝃 w⁢i−1∘𝝃 w⁢j)+limit-from subscript subscript superscript 𝝃 VIO 𝑗 𝑖 𝜀 superscript subscript superscript 𝝃 VIO 𝑗 𝑖 subscript superscript 𝝃 1 𝑤 𝑖 subscript 𝝃 𝑤 𝑗 top subscript superscript 𝚺 1 𝑗 𝑖 subscript superscript 𝝃 VIO 𝑗 𝑖 subscript superscript 𝝃 1 𝑤 𝑖 subscript 𝝃 𝑤 𝑗\displaystyle\sum_{\boldsymbol{\xi}^{\text{VIO}}{ji}\in\mathbb{\varepsilon}}(% \boldsymbol{\xi}^{\text{VIO}}{ji}\circ\boldsymbol{\xi}^{-1}{wi}\circ% \boldsymbol{\xi}{wj})^{\top}\mathbf{\Sigma}^{-1}{ji}(\boldsymbol{\xi}^{\text% {VIO}}{ji}\circ\boldsymbol{\xi}^{-1}{wi}\circ\boldsymbol{\xi}{wj})+∑ start_POSTSUBSCRIPT bold_italic_ξ start_POSTSUPERSCRIPT VIO end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∈ italic_ε end_POSTSUBSCRIPT ( bold_italic_ξ start_POSTSUPERSCRIPT VIO end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ( bold_italic_ξ start_POSTSUPERSCRIPT VIO end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_j end_POSTSUBSCRIPT ) + ω⁢∑𝐭 i∈ν(𝐭 i−(𝝃 w⁢i∘𝝃 c⁢g)[𝐭])⊤⁢𝚺 i−1⁢(𝐭 i−(𝝃 w⁢i∘𝝃 c⁢g)[𝐭]),𝜔 subscript subscript 𝐭 𝑖 𝜈 superscript subscript 𝐭 𝑖 superscript subscript 𝝃 𝑤 𝑖 subscript 𝝃 𝑐 𝑔 delimited-[]𝐭 top subscript superscript 𝚺 1 𝑖 subscript 𝐭 𝑖 superscript subscript 𝝃 𝑤 𝑖 subscript 𝝃 𝑐 𝑔 delimited-[]𝐭\displaystyle\omega\sum_{\mathbf{t}{i}\in\mathbb{\nu}}(\mathbf{t}{i}-(% \boldsymbol{\xi}{wi}\circ\boldsymbol{\xi}{cg})^{[\mathbf{t}]})^{\top}\mathbf% {\Sigma}^{-1}{i}(\mathbf{t}{i}-(\boldsymbol{\xi}{wi}\circ\boldsymbol{\xi}{% cg})^{[\mathbf{t}]}),italic_ω ∑ start_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_ν end_POSTSUBSCRIPT ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUBSCRIPT italic_c italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT [ bold_t ] end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( bold_italic_ξ start_POSTSUBSCRIPT italic_w italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUBSCRIPT italic_c italic_g end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT [ bold_t ] end_POSTSUPERSCRIPT ) ,

where ε 𝜀\mathbb{\varepsilon}italic_ε is the set of VIO camera connections, ν 𝜈\mathbb{\nu}italic_ν is the set of valid RTK GNSS poses. 𝚺 j⁢i∈ℝ 6×6 subscript 𝚺 𝑗 𝑖 superscript ℝ 6 6\mathbf{\Sigma}{ji}\in\mathbb{R}^{6\times 6}bold_Σ start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 × 6 end_POSTSUPERSCRIPT and 𝚺 i∈ℝ 3×3 subscript 𝚺 𝑖 superscript ℝ 3 3\mathbf{\Sigma}{i}\in\mathbb{R}^{3\times 3}bold_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT are the covariance matrices from the VIO and GNSS systems. 𝝃 c⁢g subscript 𝝃 𝑐 𝑔\boldsymbol{\xi}_{cg}bold_italic_ξ start_POSTSUBSCRIPT italic_c italic_g end_POSTSUBSCRIPT denotes the extrinsic calibration between the camera and the GNSS antenna. A scale term ω 𝜔\omega italic_ω is added to balance the two different domains. The ∘\circ∘-operator defines the concatenation of poses defined as 𝔰⁢𝔢⁢(3)𝔰 𝔢 3\mathfrak{se}(3)fraktur_s fraktur_e ( 3 ) and therefore is defined as follows:

𝝃 i∘𝝃 j:=log⁡(exp⁡(𝝃 i)⋅exp⁡(𝝃 j)),assign subscript 𝝃 𝑖 subscript 𝝃 𝑗⋅subscript 𝝃 𝑖 subscript 𝝃 𝑗\displaystyle\boldsymbol{\xi}{i}\circ\boldsymbol{\xi}{j}:=\log(\exp(% \boldsymbol{\xi}{i})\cdot\exp(\boldsymbol{\xi}{j})),bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ bold_italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := roman_log ( roman_exp ( bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ roman_exp ( bold_italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ,(2)

where log⁡(⋅)⋅\log(\cdot)roman_log ( ⋅ ) defines the logarithm and exp⁡(⋅)⋅\exp(\cdot)roman_exp ( ⋅ ) the exponential map of the SE⁢(3)SE 3\mathrm{SE}(3)roman_SE ( 3 ) Lie-algebra. 𝝃[𝐭]superscript 𝝃 delimited-[]𝐭\boldsymbol{\xi}^{[\mathbf{t}]}bold_italic_ξ start_POSTSUPERSCRIPT [ bold_t ] end_POSTSUPERSCRIPT denotes the translation part in 𝔰⁢𝔢⁢(3)𝔰 𝔢 3\mathfrak{se}(3)fraktur_s fraktur_e ( 3 ). The energy function is optimized using the Levenberg–Marquardt algorithm in(Kümmerle et al, 2011).

One crucial aspect of the dataset is that the reference poses that we provide are accurate enough, even though some recorded sequences contain challenging conditions in partially GNSS-denied environments. Although the stereo-inertial sensor system has an average drift around 0.6%times 0.6 percent 0.6\text{,}\mathrm{\char 37\relax}start_ARG 0.6 end_ARG start_ARG times end_ARG start_ARG % end_ARG, this cannot be guaranteed for all cases. Hence, for the reference poses in our dataset, we report whether a pose can be considered to be reliable by measuring the distance to the corresponding RTK GNSS measurement. For all poses, without corresponding RTK GNSS measurement we do not guarantee a certain accuracy. Nevertheless, due to the highly accurate stereo-inertial odometry system, these poses can be considered accurate in most cases, even in environments without GNSS, e.g. tunnels, or areas with tall buildings. We provide details about the pose accuracy in Section4.2.1.

4 Benchmark Setup

To overcome the shortcomings of existing benchmarks and datasets for autonomous driving, as discussed in Section2, we define the following requirements for an appropriate benchmark.

•Accuracy: we provide up to centimeter-accurate 6DoF poses obtained by fusing VIO measurements with RTK GNSS correction data.
•Large-scale: we provide large-scale sequences (trajectories longer than 10 km times 10 kilometer 10\text{,}\mathrm{km}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_km end_ARG) to allow for extensive evaluations of SLAM and visual localization under challenging conditions.
•Diversity: besides large-scale, we also provide both short-term and long-term changes within the recorded scenes. This is important to evaluate the generalization capabilities of recent learning-based methods.
•Multitask: the benchmark can be used to evaluate visual odometry, global place recognition, and map-based visual localization under challenging conditions.

Based on these properties, we propose a novel large-scale dataset that is used as an extensive benchmark suite for evaluating multitasking challenges related to autonomous driving under changing conditions. The sequences have been collected in the metropolitan area of Munich, Germany. The different scenes are described in the next section.

Table 1: Statistics of the 4Seasons benchmark. This table shows the different scenarios and recordings along with the weather condition, seasons, and time of the day from our benchmark. We provide a variety of scenarios and short-term to long-term changes. The recordings in this table are used for the benchmark evaluation. The ground truth (GNSS/IMU, point clouds, and reference poses) is withheld. Benchmark type (VO = visual odometry, GPR = global place recognition, MBVL = map-based visual localization) defines the benchmark a sequence is used for. All recordings with ground truth are shown in Table5.

Scenario Recording Weather(cloudy, rainy, snowy, sunny)Season(winter, spring, summer, fall)Daytime(morning, afternoon, evening, night)Benchmark Type Map Accuracy Horizontal RMSE(GNSS-Ref. Pose)Map Accuracy% of Accurate Poses office_loop_1_test 2020-03-03_12-12-32 cloudy spring afternoon GPR 12.29 cm times 12.29 centimeter 12.29\text{,}\mathrm{cm}start_ARG 12.29 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 59.91%times 59.91 percent 59.91\text{,}\mathrm{\char 37\relax}start_ARG 59.91 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_2_test 2020-03-26_15-03-02 cloudy/sunny spring afternoon VO 5.14 cm times 5.14 centimeter 5.14\text{,}\mathrm{cm}start_ARG 5.14 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 90.22%times 90.22 percent 90.22\text{,}\mathrm{\char 37\relax}start_ARG 90.22 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_3_test 2021-05-10_19-25-54 cloudy spring evening VO + MBVL 5.78 cm times 5.78 centimeter 5.78\text{,}\mathrm{cm}start_ARG 5.78 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 92.06%times 92.06 percent 92.06\text{,}\mathrm{\char 37\relax}start_ARG 92.06 end_ARG start_ARG times end_ARG start_ARG % end_ARG highway_1_test 2020-10-08_10-19-46 sunny fall morning VO 8.04 cm times 8.04 centimeter 8.04\text{,}\mathrm{cm}start_ARG 8.04 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 73.65%times 73.65 percent 73.65\text{,}\mathrm{\char 37\relax}start_ARG 73.65 end_ARG start_ARG times end_ARG start_ARG % end_ARG highway_2_test 2021-02-25_13-11-30 sunny winter afternoon VO 4.80 cm times 4.80 centimeter 4.80\text{,}\mathrm{cm}start_ARG 4.80 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 74.31%times 74.31 percent 74.31\text{,}\mathrm{\char 37\relax}start_ARG 74.31 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_1_test 2020-03-26_14-54-05 cloudy spring afternoon GPR 2.20 cm times 2.20 centimeter 2.20\text{,}\mathrm{cm}start_ARG 2.20 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 87.38%times 87.38 percent 87.38\text{,}\mathrm{\char 37\relax}start_ARG 87.38 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_2_test 2021-05-10_18-26-26 cloudy spring evening VO + MBVL 1.51 cm times 1.51 centimeter 1.51\text{,}\mathrm{cm}start_ARG 1.51 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 87.42%times 87.42 percent 87.42\text{,}\mathrm{\char 37\relax}start_ARG 87.42 end_ARG start_ARG times end_ARG start_ARG % end_ARG business_campus_1_test 2021-01-07_13-03-56 cloudy/snowy winter afternoon VO + MBVL 3.39 cm times 3.39 centimeter 3.39\text{,}\mathrm{cm}start_ARG 3.39 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 97.36%times 97.36 percent 97.36\text{,}\mathrm{\char 37\relax}start_ARG 97.36 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_1_test 2020-03-26_14-30-52 cloudy spring afternoon GPR 2.53 cm times 2.53 centimeter 2.53\text{,}\mathrm{cm}start_ARG 2.53 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 91.75%times 91.75 percent 91.75\text{,}\mathrm{\char 37\relax}start_ARG 91.75 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_2_test 2021-01-07_14-03-57 cloudy/snowy winter afternoon VO + MBVL 2.36 cm times 2.36 centimeter 2.36\text{,}\mathrm{cm}start_ARG 2.36 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 92.21%times 92.21 percent 92.21\text{,}\mathrm{\char 37\relax}start_ARG 92.21 end_ARG start_ARG times end_ARG start_ARG % end_ARG city_loop_1_test 2020-03-03_12-28-45 cloudy spring afternoon GPR 5.36 cm times 5.36 centimeter 5.36\text{,}\mathrm{cm}start_ARG 5.36 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 83.62%times 83.62 percent 83.62\text{,}\mathrm{\char 37\relax}start_ARG 83.62 end_ARG start_ARG times end_ARG start_ARG % end_ARG city_loop_2_test 2021-02-25_11-27-40 sunny winter morning VO + MBVL 3.36 cm times 3.36 centimeter 3.36\text{,}\mathrm{cm}start_ARG 3.36 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 81.40%times 81.40 percent 81.40\text{,}\mathrm{\char 37\relax}start_ARG 81.40 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_1_test 2020-10-08_12-11-19 cloudy fall afternoon GPR 7.19 cm times 7.19 centimeter 7.19\text{,}\mathrm{cm}start_ARG 7.19 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 94.26%times 94.26 percent 94.26\text{,}\mathrm{\char 37\relax}start_ARG 94.26 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_2_test 2021-05-10_19-51-14 cloudy spring evening VO 1.84 cm times 1.84 centimeter 1.84\text{,}\mathrm{cm}start_ARG 1.84 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 96.04%times 96.04 percent 96.04\text{,}\mathrm{\char 37\relax}start_ARG 96.04 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_3_test 2021-05-10_21-18-00 cloudy spring night VO + MBVL 4.94 cm times 4.94 centimeter 4.94\text{,}\mathrm{cm}start_ARG 4.94 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 92.07%times 92.07 percent 92.07\text{,}\mathrm{\char 37\relax}start_ARG 92.07 end_ARG start_ARG times end_ARG start_ARG % end_ARG maximilianeum_1_test 2021-02-25_12-16-32 sunny winter afternoon VO 1.90 cm times 1.90 centimeter 1.90\text{,}\mathrm{cm}start_ARG 1.90 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 80.13%times 80.13 percent 80.13\text{,}\mathrm{\char 37\relax}start_ARG 80.13 end_ARG start_ARG times end_ARG start_ARG % end_ARG maximilianeum_2_test 2021-05-10_20-59-00 cloudy spring night VO 12.46 cm times 12.46 centimeter 12.46\text{,}\mathrm{cm}start_ARG 12.46 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 76.46%times 76.46 percent 76.46\text{,}\mathrm{\char 37\relax}start_ARG 76.46 end_ARG start_ARG times end_ARG start_ARG % end_ARG parking_garage_1_test 2020-06-12_10-29-20 sunny summer morning VO + MBVL 0.75 cm times 0.75 centimeter 0.75\text{,}\mathrm{cm}start_ARG 0.75 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 35.06%times 35.06 percent 35.06\text{,}\mathrm{\char 37\relax}start_ARG 35.06 end_ARG start_ARG times end_ARG start_ARG % end_ARG parking_garage_2_test 2021-05-10_19-18-36 cloudy spring evening GPR 4.54 cm times 4.54 centimeter 4.54\text{,}\mathrm{cm}start_ARG 4.54 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 40.75%times 40.75 percent 40.75\text{,}\mathrm{\char 37\relax}start_ARG 40.75 end_ARG start_ARG times end_ARG start_ARG % end_ARG

Figure 3: Data collection map. This figure shows the map of the covered area of our benchmark dataset. We provide sequences at a large scale and a huge variety of different environments. A detailed visualization of each scenario’s trajectory is shown in Figure15.

4.1 Scenarios

This section describes the different sequences we have collected for the dataset. The sequences involve different scenarios – ranging from urban driving to a parking garage and rural areas. We provide complex trajectories, which include partially overlapping routes, and multiple loops within a sequence. For each scenario, we have collected multiple traversals covering a large range of variations in the structure and environmental appearance due to weather, illumination, dynamic objects, and seasonal effects. In total, our benchmark dataset consists of nine different scenarios.

Figure3 shows the covered area, including highlighted traces. Each scenario is visualized in a separate color. We now describe each scene in more detail.

1.Office Loop. A loop around an industrial area of the city.
2.Highway. A drive along the A9 three-lane highway in the northern part of Munich.
3.Neighborhood. Traversal through a neighborhood at the outskirts of the city, covering detached houses with gardens and trees in the street.
4.Business Campus. Several loops around a campus in a business area.
5.Countryside. Rural area around agricultural fields that exhibits very homogeneous and repetitive structures.
6.City Loop. A large-scale loop at a ring road within the city of Munich, including a tunnel.
7.Old Town. Loop around the urban city center with tall buildings, much traffic, and dynamic objects.
8.Maximilianeum. The Maximilianeum is a famous palatial building in Munich which is located at the eastern end of a royal avenue with paving stones and a tram route.
9.Parking Garage. A three-level parking garage to benchmark combined indoor and outdoor environments.

The VIO traces for each scenario are shown in Figure15. We provide reference poses and 3D models as sparse point clouds generated by our ground truth generation pipeline (c.f.Figure4) along with the corresponding raw image frames and raw IMU measurements. Figure5 shows an example of the optimized trajectory, which depicts the accuracy of the provided reference poses. Table1 shows all the sequences with withheld ground truth used for benchmarking.

The benchmark dataset presents a challenge to current approaches to visual SLAM and long-term localization because it contains data from different seasons and weather conditions, as well as from different times of day, as shown in Figure14.

Figure 4: 3D models of different scenarios contained in the dataset. The figure shows an office loop around an industrial area (left), multiple loops around a business campus with high buildings (middle), and a stretch recorded in a multi-level parking garage (right). The green lines encode the GNSS trajectories, and the red lines encode the VIO trajectories. Top: shows the trajectories before the fusion using pose graph optimization. Bottom: shows the results after the pose graph optimization. Note that after the pose graph optimization, the reference trajectories are well aligned.

Figure 5: Reference poses validation. This figure shows two additional 3D models of the scenarios collected. Note that these two sequences are quite large (more than 10 km times 10 kilometer 10\text{,}\mathrm{km}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_km end_ARG and 6 km times 6 kilometer 6\text{,}\mathrm{km}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_km end_ARG, respectively). Top: before the fusion using pose graph optimization. Bottom: results after optimization. The green lines encode the GNSS trajectories, the red lines show the VIO trajectories (before fusion) and the fused trajectories (after fusion). The left part of the figure shows a zoomed-in view of a tunnel, where the GNSS signal becomes very noisy, as highlighted in the red boxes. Besides, due to the large size of the sequence, the accumulated tracking error leads to a significant deviation of the VIO trajectory from the GNSS recordings. Our pose graph optimization, by depending globally on GNSS positions and locally on VIO relative poses, successfully eliminates global VIO drifts and local GNSS positioning flaws.

4.2 Reference Pose Validation

The top part of Figure1 shows two overlaid point clouds from different runs across the same scene. Note that despite the weather and seasonal differences, the point clouds align very well. This shows that our reference poses are sufficiently accurate for benchmarking long-term localization. Furthermore, a qualitative assessment of the point-to-point correspondences is shown in Figure6. The figure shows a subset of very accurate pixel-wise correspondences across different seasons (fall/winter) in the top and different illumination conditions (sunny/night) in the bottom. These point-to-point correspondences are a result of our up to centimeter-accurate global reference poses. This makes them suitable as training pairs for learning-based algorithms. Recently, there has been an increasing demand for pixel-wise cross-season correspondences, which are needed to learn dense feature descriptors(Spencer et al, 2020; Dusmanu et al, 2019; Revaud et al, 2019b). However, there is still a lack of datasets to satisfy this demand. The KITTI(Geiger et al, 2013) dataset does not provide cross-season data. The Oxford RobotCar Dataset(Maddern et al, 2017) provides cross-season data, however, since the ground truth is not accurate enough, the paper does not recommend benchmarking localization and mapping approaches.

Figure 6: Accurate pixel-wise correspondences, making cross-season training possible. Qualitative assessment of the accuracy of our data collection and geometric reconstruction method for a sample of four different conditions (from top left in clockwise order: cloudy, snowy, night, sunny) across the same scene. Each same colored point in the four images corresponds to the same geometric point in the world. The cameras corresponding to these images have different poses in the global frame of reference. Please note that the points are not matched, but rather a result of our accurate reference poses and geometric reconstruction.

Recently, RobotCar Seasons(Sattler et al, 2018) was proposed to overcome the inaccuracy of the provided ground truth. However, similar to the authors of(Spencer et al, 2020), we found that it is still challenging to obtain accurate cross-season pixel-wise matches due to pose inconsistencies. Furthermore, this dataset only provides images captured from three synchronized cameras mounted on a car, pointing to the rear-left, rear, and rear-right, respectively. Moreover, another limitation of the dataset is that it only provides relatively small segments and no long trajectories. Furthermore, a significant portion of it suffers from strong motion blur and low image quality.

4.2.1 Pose Accuracy

One potential limitation of our benchmark dataset is that we can only guarantee a certain pose accuracy when GNSS is available. Naturally, GNSS is unreliable in urban canyons or tunnels. Therefore, for the benchmark evaluation, we only consider poses as reference poses if GNSS is available and the observed standard deviation of the position is less than 5 cm times 5 centimeter 5\text{,}\mathrm{cm}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG. Please note that we only require accurate reference poses for the evaluation of visual localization. The evaluation of VO is based on the accumulated drift over time, i.e. it is only required that the start and end positions for each segment of a sequence are accurate. Furthermore, we provide quantitative measures of the quality of the maps. We report the percentage of accurate reference poses for each trajectory. Moreover, we report the overall map accuracy in terms of horizontal RMSE between the GNSS poses and the refined poses after pose graph optimization.

The percentage of accurate poses for each test sequence can be seen in Table1 and Table5 for the training sequences. For a qualitative visual analysis, we show accurate pixel-wise correspondences in Figure6, indicating that the reference poses are sufficiently accurate. We do not claim that our poses are consistently centimeter-accurate, however, by analyzing the map accuracy we can assure the quality of the poses used for benchmarking.

4.3 Data Source

We release (distorted & undistorted) 8-bit grayscale images, IMU measurements, and sensor calibration, including the calibration sequences, for all sequences (training and testing). In addition, RTK GNSS measurements, in NMEA format, VO point clouds, and reference poses are released only for training sequences. For the testing sequences, such data is withheld for evaluation. Moreover, we specify the distance between the refined reference poses and the raw RTK GNSS measurements.

5 Benchmark Tasks

In this section, we define the benchmark evaluation metrics, tasks, and their evaluation protocols for visual odometry, global place recognition, and map-based visual localization. Visual localization consists of retrieving the 6DoF pose of a query within an existing 3D model and can be interpreted as a two-step approach. First, global image retrieval is performed to obtain a rough estimate of the query pose w.r.t. a map. Second, local feature matching is used to refine the pose estimate.

For the evaluation, in each task, we consider a set of estimated 6DoF poses 𝐓 i est∈SE⁢(3)superscript subscript 𝐓 𝑖 est SE 3\mathbf{T}{i}^{\text{est}}\in\mathrm{SE}(3)bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ∈ roman_SE ( 3 ), as well as a set of reference, poses 𝐓 i ref∈SE⁢(3)superscript subscript 𝐓 𝑖 ref SE 3\mathbf{T}{i}^{\text{ref}}\in\mathrm{SE}(3)bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ∈ roman_SE ( 3 ). While the reference poses are always defined w.r.t. a global world frame, the estimated poses are defined either w.r.t. the same global world frame (for global place recognition and map-based visual localization) or to a selected local frame 2 2 2 Can be for instance the camera frame of the first recorded left camera image. (visual odometry).

5.1 Visual Odometry in Challenging Conditions

\Acl

vo aims to accurately estimate the relative 6DoF camera pose based on recorded images. To benchmark the task of VO there already exists various datasets(Geiger et al, 2012; Sturm et al, 2012; Engel et al, 2016). All of these existing datasets consist of sequences recorded at rather homogeneous conditions (indoors, or sunny/overcast outdoor conditions). However, methods specially developed for autonomous driving use cases must perform robustly under almost any condition. We believe that the proposed benchmark will contribute to improving the performance of VO under diverse weather and lighting conditions in an automotive environment. Therefore, instead of replacing existing benchmarks and datasets, we aim to provide an extension that is more focused on challenging conditions in autonomous driving. As we provide frame-wise accurate poses for large portions of the sequences, metrics well known from other benchmarks like absolute trajectory error (ATE) or relative pose error (RPE)(Geiger et al, 2012; Sturm et al, 2012) are also applicable to our data.

5.1.1 Evaluation Metrics

Similar to previous benchmarks, the main accuracy measure we are interested in is the RPE. In general, the RPE is split up into a translational and a rotational error. However, another component we are interested in is the scale error. One may argue that, especially for stereo approaches, scale errors are marginal and therefore not relevant. Nevertheless, our experience is different. We observe that quite significant scale errors and drift can occur when performing stereo VO and SLAM in automotive environments. This can be caused either by the miss-calibration of the cameras, by the structure of the scene but also by algorithm-specific design choices like the type of keypoint detector, etc. Since the sensor setup has a limited stereo baseline, parallaxes (i.e. pixel disparities) for far object points are vanishing. This means that, even for stereo approaches, the scale becomes non-observable if no close static objects are present in the scene. Increasing the stereo baseline, however, could reduce the rigidity of the sensor setup. We believe that it is very valuable to conduct further research on stereo VO and SLAM methods which explicitly consider the depth uncertainties created by the length of the stereo baseline.

Since in automotive use cases, the scale can always be observed based on a reference system, like wheel ticks, GNSS or a reference map, we consider only relative errors (drifts) in scale, translation, and rotation in the proposed benchmark. Therefore, before evaluation, a global scale alignment is performed for the entire trajectory with respect to the reference trajectory.

For the proposed VO benchmark all evaluation metrics are defined based on the estimated relative pose 𝐓 i⁢j est∈SE⁢(3)superscript subscript 𝐓 𝑖 𝑗 est SE 3\mathbf{T}{ij}^{\text{est}}\in\mathrm{SE}(3)bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ∈ roman_SE ( 3 ) between two frames i 𝑖 i italic_i and j 𝑗 j italic_j and its corresponding reference pose 𝐓 i⁢j ref∈SE⁢(3)superscript subscript 𝐓 𝑖 𝑗 ref SE 3\mathbf{T}{ij}^{\text{ref}}\in\mathrm{SE}(3)bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ∈ roman_SE ( 3 ) with:

𝐓 i⁢j ref superscript subscript 𝐓 𝑖 𝑗 ref\displaystyle\mathbf{T}{ij}^{\text{ref}}bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT=(𝐓 i ref)−1⁢𝐓 j ref and absent superscript superscript subscript 𝐓 𝑖 ref 1 superscript subscript 𝐓 𝑗 ref and\displaystyle=\left(\mathbf{T}{i}^{\text{ref}}\right)^{-1}\mathbf{T}{j}^{% \text{ref}}\quad\text{and}\quad= ( bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT and 𝐓 i⁢j est superscript subscript 𝐓 𝑖 𝑗 est\displaystyle\mathbf{T}{ij}^{\text{est}}bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT=(𝐓 i est)−1⁢𝐓 j est.absent superscript superscript subscript 𝐓 𝑖 est 1 superscript subscript 𝐓 𝑗 est\displaystyle=\left(\mathbf{T}{i}^{\text{est}}\right)^{-1}\mathbf{T}{j}^{% \text{est}}.= ( bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT .(3)

For a pair of frames (i 𝑖 i italic_i, j 𝑗 j italic_j) for which reference poses are available, we calculate the relative translational error ϵ i⁢j t superscript subscript italic-ϵ 𝑖 𝑗 𝑡\epsilon_{ij}^{t}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, rotational error ϵ i⁢j r superscript subscript italic-ϵ 𝑖 𝑗 𝑟\epsilon_{ij}^{r}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, and scale error ϵ~~i⁢j s superscript subscript~~italic-ϵ 𝑖 𝑗 𝑠\tilde{\epsilon}_{ij}^{s}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as given in Equations(4) to(6).

ϵ i⁢j t superscript subscript italic-ϵ 𝑖 𝑗 𝑡\displaystyle\epsilon_{ij}^{t}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT=‖𝐭 i⁢j ref−𝐭 i⁢j est‖2 d i⁢j absent subscript norm superscript subscript 𝐭 𝑖 𝑗 ref superscript subscript 𝐭 𝑖 𝑗 est 2 subscript 𝑑 𝑖 𝑗\displaystyle=\frac{|\mathbf{t}{ij}^{\text{ref}}-\mathbf{t}{ij}^{\text{est}% }|{2}}{d{ij}}= divide start_ARG ∥ bold_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT - bold_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG(4) ϵ i⁢j r superscript subscript italic-ϵ 𝑖 𝑗 𝑟\displaystyle\epsilon_{ij}^{r}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT=arccos⁡(1 2⁢(trace⁡((𝐑 i⁢j ref)−1⁢𝐑 i⁢j est)−1))d i⁢j absent 1 2 trace superscript superscript subscript 𝐑 𝑖 𝑗 ref 1 superscript subscript 𝐑 𝑖 𝑗 est 1 subscript 𝑑 𝑖 𝑗\displaystyle=\frac{\arccos\left(\frac{1}{2}\left(\operatorname{trace}{((% \mathbf{R}{ij}^{\text{ref}})^{-1}\mathbf{R}{ij}^{\text{est}})}-1\right)% \right)}{d_{ij}}= divide start_ARG roman_arccos ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_trace ( ( bold_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ) - 1 ) ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG(5) ϵ~~i⁢j s superscript subscript~~italic-ϵ 𝑖 𝑗 𝑠\displaystyle\tilde{\epsilon}{ij}^{s}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT=‖𝐭 i⁢j est‖2‖𝐭 i⁢j ref‖2 absent subscript norm superscript subscript 𝐭 𝑖 𝑗 est 2 subscript norm superscript subscript 𝐭 𝑖 𝑗 ref 2\displaystyle=\frac{\left|\mathbf{t}{ij}^{\text{est}}\right|{2}}{\left|% \mathbf{t}{ij}^{\text{ref}}\right|_{2}}= divide start_ARG ∥ bold_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_t start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG(6)

From ϵ~~i⁢j s superscript subscript~~italic-ϵ 𝑖 𝑗 𝑠\tilde{\epsilon}{ij}^{s}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT one obtains the final relative scale error as ϵ i⁢j s=max⁡[ϵ~~i⁢j s,(ϵ~~i⁢j s)−1]superscript subscript italic-ϵ 𝑖 𝑗 𝑠 superscript subscript~~italic-ϵ 𝑖 𝑗 𝑠 superscript superscript subscript~~italic-ϵ 𝑖 𝑗 𝑠 1\epsilon{ij}^{s}=\max[\tilde{\epsilon}{ij}^{s},(\tilde{\epsilon}{ij}^{s})^{% -1}]italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = roman_max [ over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ( over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ]. The parameter d i⁢j subscript 𝑑 𝑖 𝑗 d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT defines the reference path length between the two poses 𝐓 i ref superscript subscript 𝐓 𝑖 ref\mathbf{T}{i}^{\text{ref}}bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT and 𝐓 j ref superscript subscript 𝐓 𝑗 ref\mathbf{T}{j}^{\text{ref}}bold_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT.

Meaningful metrics are obtained by extracting all possible sub-segments of length 100 m times 100 meter 100\text{,}\mathrm{m}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 200 m times 200 meter 200\text{,}\mathrm{m}start_ARG 200 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 400 m times 400 meter 400\text{,}\mathrm{m}start_ARG 400 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 600 m times 600 meter 600\text{,}\mathrm{m}start_ARG 600 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 800 m times 800 meter 800\text{,}\mathrm{m}start_ARG 800 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, and 1000 m times 1000 meter 1000\text{,}\mathrm{m}start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG from a trajectory and calculating the relative poses between the first and last frame of each sub-segment. Furthermore, for trajectory segments where no GNSS measurements are available for more than 1000 m times 1000 meter 1000\text{,}\mathrm{m}start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG (e.g. in tunnels, garages, or urban canyons), also the relative poses of such an entire stretch are taken into account. This allows us to also consider challenging scenarios like tunnels and the transition from bright to dark in the benchmark. Using sub-segments of different lengths for evaluation is inspired by the KITTI benchmark(Geiger et al, 2012) and allows capturing both short and long-term accuracy of VO algorithms.

To obtain single number metrics for every sequence, we consider the visual VO successful if the errors are within certain positional, rotational, and scale bounds. We define three intervals by varying the thresholds: high precision (0.5%times 0.5 percent 0.5\text{,}\mathrm{\char 37\relax}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG % end_ARG, 0.005 deg/m times 0.005 deg m 0.005\text{,}\mathrm{d}\mathrm{e}\mathrm{g}\mathrm{/}\mathrm{m}start_ARG 0.005 end_ARG start_ARG times end_ARG start_ARG roman_deg / roman_m end_ARG, 1.005 (multiplier)), medium precision (1%times 1 percent 1\text{,}\mathrm{\char 37\relax}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG % end_ARG, 0.01 deg/m times 0.01 deg m 0.01\text{,}\mathrm{d}\mathrm{e}\mathrm{g}\mathrm{/}\mathrm{m}start_ARG 0.01 end_ARG start_ARG times end_ARG start_ARG roman_deg / roman_m end_ARG, 1.01 (multiplier)), and coarse precision (2%times 2 percent 2\text{,}\mathrm{\char 37\relax}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG % end_ARG, 0.02 deg/m times 0.02 deg m 0.02\text{,}\mathrm{d}\mathrm{e}\mathrm{g}\mathrm{/}\mathrm{m}start_ARG 0.02 end_ARG start_ARG times end_ARG start_ARG roman_deg / roman_m end_ARG, 1.02 (multiplier)).

While the translational error is the most meaningful metric to evaluate VO algorithms, the rotational error, and scale error still give valuable insight into the specific behavior of a certain approach.

5.2 Global Place Recognition

Global place recognition refers to the task of retrieving the most similar database image given a query image(Lowry et al, 2015). To improve the searching efficiency and the robustness against different weather conditions, tremendous progress on global descriptors(Jégou et al, 2010; Arandjelovic and Zisserman, 2013; Angeli et al, 2008; Gálvez-López and Tardos, 2012) has been seen. For the localization pipeline, visual place recognition serves as the initialization step to the downstream local pose refinement by providing the most similar database images as well as the corresponding global poses. Due to the advent of deep neural networks(Simonyan and Zisserman, 2015; Krizhevsky et al, 2012; He et al, 2016; Szegedy et al, 2015), methods aggregating deep image features are proposed and have shown advantages over classical methods(Arandjelovic et al, 2016; Gordo et al, 2016; Radenović et al, 2018; Tolias et al, 2015).

Figure 7: Challenging scenes for global place recognition. Top: two pictures share the same location with different appearances. Bottom: two pictures have a similar appearance but are taken at different locations.

The proposed dataset is challenging for global place recognition since it contains not only cross-season images that have different appearances with a similar geographical location but also intra-season images which share similar appearances but with different locations. This results in mainly two different types: images taken at the same place, but look different, or images taken at different places but look similar. Figure7 depicts example pairs of these scenarios.

5.2.1 Evaluation Metrics

We follow the standard metric widely used for global place recognition(Arandjelovic et al, 2016; Arandjelovic and Zisserman, 2013; Sattler et al, 2012; Gordo et al, 2016), namely the recall at top N 𝑁 N italic_N retrievals with a certain range bound as the positive threshold. Specifically, one query image is considered to be correctly localized if at least one of the top N 𝑁 N italic_N retrieved images is within a certain translational (in meters) and a certain rotational (in degrees) bound with respect to the ground-truth location of the query image. The translational error ϵ t superscript italic-ϵ 𝑡\epsilon^{t}italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is measured as the Euclidean distance:

ϵ t=‖𝐭 ref−𝐭 est‖2 superscript italic-ϵ 𝑡 subscript norm superscript 𝐭 ref superscript 𝐭 est 2\displaystyle\epsilon^{t}=|\mathbf{t}^{\text{ref}}-\mathbf{t}^{\text{est}}|_% {2}italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∥ bold_t start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT - bold_t start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(7)

between the reference 𝐭 ref superscript 𝐭 ref\mathbf{t}^{\text{ref}}bold_t start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT and estimated 𝐭 est superscript 𝐭 est\mathbf{t}^{\text{est}}bold_t start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT camera positions. The rotational error ϵ r superscript italic-ϵ 𝑟\epsilon^{r}italic_ϵ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is measured as an angle in degrees (following(Hartley et al, 2013)) by calculating:

ϵ r=arccos⁡(1 2⁢(trace⁡((𝐑 ref)−1⁢𝐑 est)−1)),superscript italic-ϵ 𝑟 1 2 trace superscript superscript 𝐑 ref 1 superscript 𝐑 est 1\displaystyle\epsilon^{r}=\arccos\left(\frac{1}{2}\left(\operatorname{trace}{(% (\mathbf{R}^{\text{ref}})^{-1}\mathbf{R}^{\text{est}})}-1\right)\right),italic_ϵ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = roman_arccos ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_trace ( ( bold_R start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_R start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT ) - 1 ) ) ,(8)

where 𝐑 ref superscript 𝐑 ref\mathbf{R}^{\text{ref}}bold_R start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT, and 𝐑 est superscript 𝐑 est\mathbf{R}^{\text{est}}bold_R start_POSTSUPERSCRIPT est end_POSTSUPERSCRIPT denote the reference and estimated camera rotation matrices. In the evaluation of global place recognition, we calculate the recalls under different threshold settings: by fixing N 𝑁 N italic_N and changing the range bound, or by fixing the range bound and changing N 𝑁 N italic_N. We will describe the specific settings in Section6.2.

5.3 Map-Based Visual Localization

Map-based visual localization refers to the task of locally refining the 6DoF pose between reference images and images from a query sequence. In contrast to wide-baseline stereo matching, for map-based visual localization, it is also possible to utilize the sequential information of the sequence. This allows estimating depth values by running a standard VO method. Those depth estimates can then be used to improve the tracking of the individual localization candidates.

In contrast to global place recognition which only uses 2D images and no other information, this task allows the use of a globally-consistent 3D reconstruction of the reference scene. In this task, we assume to know the mapping between reference and query samples and only focus on the local pose refinement task. In practice, this mapping can be found using image retrieval techniques as described in Section5.2 or by using GNSS measurements as a coarse initialization if available.

Accurately localizing in a pre-built map is a challenging problem, especially if the visual appearance of the query sequence significantly differs from the base map. This makes it extremely difficult, especially for vision-based systems, since the localization accuracy is often limited by the discriminative power of feature descriptors. Our proposed dataset allows evaluating visual localization across multiple types of weather conditions and diverse scenes, ranging from urban to countryside driving. Furthermore, our up to centimeter-accurate reference poses allow us to create more strict evaluation settings with an increased level of difficulty. This allows us to determine the limitations and robustness of current state-of-the-art methods.

5.3.1 Evaluation Metrics

For evaluation, we measure the translational and rotational error of any method between the estimated and the reference pose. Please refer to Equation(7) and Equation(8) for the definitions of the translational and rotational error, respectively.

We consider the localization successful if a query image is localized within certain positional (in meters) and rotational (in degrees) bounds with respect to their reference pose. We define three localization intervals by varying the thresholds: high precision (0.1 m times 0.1 meter 0.1\text{,}\mathrm{m}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 1°times 1 degree 1\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG ° end_ARG), medium precision (0.25 m times 0.25 meter 0.25\text{,}\mathrm{m}start_ARG 0.25 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 2°times 2 degree 2\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG ° end_ARG), and coarse precision (1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 5°times 5 degree 5\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG ° end_ARG).

6 Experimental Evaluation

In this section, we evaluate the current state-of-the-art baseline methods for each of the three provided benchmarks (visual odometry, global place recognition, and map-based visual localization) to demonstrate the diversity and challenges of the benchmark. We will establish an open leaderboard for the benchmark to compare different methods upon publication. This allows the reproduction of the baseline results for every user. Furthermore, we will set up a server for an automatic evaluation of the results on the withheld test set.

6.1 Visual Odometry in Challenging Conditions

We provide results for state-of-the-art baseline stereo and stereo-inertial odometry and SLAM approaches. The methods provided as baselines are classical geometric approaches. Nevertheless, we strongly encourage researchers to evaluate learning-based methods on our benchmark as well. In particular, we provide results for the following stereo and stereo-inertial VO methods: ORB-SLAM3 3 3 3https://github.com/UZ-SLAMLab/ORB_SLAM3(Campos et al, 2020) and Basalt 4 4 4https://gitlab.com/VladyslavUsenko/basalt(Usenko et al, 2019).

Table 2: \Acl vo results on known scenarios from the 4Seasons benchmark. This table shows the evaluation results of state-of-the-art baseline methods on the VO benchmark. The best-performing results are in bold. The results are shown in terms of the percentage of high / medium / coarse precision.

Method office_loop_2_test office_loop_3_test neighborhood_2_test business_campus_1_test countryside_2_test city_loop_2_test old_town_2_test old_town_3_test parking_garage_1_test Average Basalt(Usenko et al, 2019) (stereo)9.1 / 65.7 / 96.7 6.3 / 53.0 / 94.4 4.2 / 21.5 / 70.1 2.3 / 28.5 / 71.5 7.7 / 38.3 / 77.6 11.4 / 43.8 / 72.3 5.6 / 31.1 / 78.0 1.3 / 9.0 / 37.4 0.0 / 0.0 / 33.3 5.3 / 32.3 / 70.2 Basalt(Usenko et al, 2019) (stereo-inertial)3.3 / 35.0 / 92.0 2.1 / 20.9 / 80.8 3.5 / 23.6 / 72.9 11.3 / 59.0 / 95.7 16.4 / 48.5 / 88.8 23.1 / 59.2 / 88.2 0.0 / 0.0 / 0.0 1.5 / 15.4 / 42.6 0.0 / 11.1 / 55.6 6.8 / 30.3 / 68.5 ORB-SLAM3(Campos et al, 2020) (stereo)16.8 / 65.3 / 94.9 1.4 / 24.0 / 82.2 4.9 / 55.6 / 95.8 3.9 / 42.2 / 82.8 5.8 / 41.4 / 76.6 1.2 / 12.6 / 49.3 0.8 / 17.7 / 57.0 0.3 / 1.0 / 2.8 0.0 / 22.2 / 77.8 3.9 / 31.3 / 68.8 ORB-SLAM3(Campos et al, 2020) (stereo-inertial)7.3 / 33.6 / 84.7 2.1 / 15.7 / 50.2 13.2 / 44.4 / 84.0 19.9 / 64.8 / 91.0 2.9 / 11.4 / 42.6 26.1 / 59.2 / 77.9 12.9 / 43.3 / 87.1 0.0 / 0.0 / 0.0 0.0 / 11.1 / 44.4 9.4 / 31.5 / 62.4

Table 3: \Acl vo results on unknown scenarios from the 4Seasons benchmark. This table shows the evaluation results of state-of-the-art baseline methods on the VO benchmark. The best-performing results are in bold. The results are shown in terms of the percentage of high / medium / coarse precision.

Method highway_1_test highway_2_test maximilianeum_1_test maximilianeum_2_test Average Basalt(Usenko et al, 2019) (stereo)9.4 / 32.0 / 63.1 10.3 / 29.5 / 52.7 35.1 / 75.3 / 91.8 1.2 / 9.8 / 38.2 14.0 / 36.6 / 61.4 Basalt(Usenko et al, 2019) (stereo-inertial)32.3 / 68.6 / 85.8 21.3 / 49.2 / 71.2 34.0 / 69.6 / 94.3 0.0 / 6.9 / 27.2 21.9 / 48.6 / 69.6 ORB-SLAM3(Campos et al, 2020) (stereo)0.6 / 3.9 / 22.1 0.0 / 0.0 / 0.0 0.0 / 19.6 / 56.7 0.0 / 0.0 / 0.0 0.2 / 5.9 / 19.7 ORB-SLAM3(Campos et al, 2020) (stereo-inertial)10.3 / 29.6 / 48.6 12.9 / 25.1 / 68.3 26.3 / 60.8 / 79.4 10.4 / 39.3 / 56.1 15.0 / 38.7 / 63.1

Figure 8: Performance of state-of-the-art baseline visual odometry methods on known scenarios from the 4Seasons benchmark. The figure shows the translational error (in %percent%%), rotational error (in mdeg/m mdeg m\text{mdeg}/\text{m}mdeg / m), and scale error (multiplier).

Figure 9: Performance of state-of-the-art visual odometry methods on unknown scenarios from the 4Seasons benchmark. The figure shows the translational error (in %percent%%), rotational error (in mdeg/m mdeg m\text{mdeg}/\text{m}mdeg / m), and scale error (multiplier).

The provided VO benchmark is divided into two sets of evaluation sequences: unknown scenarios and known scenarios. Unknown scenarios consist only of scenarios, for which no sequences at all are provided in the training set. Namely, these are the scenarios Highway and Maximilianeum. Known scenarios are those scenarios for which are also sequences provided in the training set. While this is irrelevant for pure geometric approaches, we believe that this separation will be important to evaluate the generalization capabilities of learning-based approaches. Table2 shows the evaluation results on the individual sequences of the benchmark for known scenarios. Figure8 shows the results across all sequences corresponding to the known scenarios in cumulative error plots. Table3 shows the evaluation results on the individual sequences of the benchmark for unknown scenarios. Figure9 shows the results across all sequences corresponding to the unknown scenarios in cumulative error plots.

From Table2 and3 one can observe that all evaluated methods perform significantly worse on the unknown scenarios. Nevertheless, this is mainly due to the challenging conditions, which are on one side highway sequences with high speed and sudden lighting changes under bridges as well as inner city night sequences.

While the results above provide average numbers across all sequences of the benchmark, we provide in Figure10 and11 side-by-side the results for identical scenarios but for different conditions, respectively.

Figure10 provides VO results on the Maximilianeum scenario in the afternoon (maximilianeum_1_test) and at night (maximilianeum_2_test), respectively. As one could expect, there is a significant drop in performance when going from day to night due to less visible landmarks. Nevertheless, it is interesting to observe that ORB-SLAM3 (with IMU) can perform better during the night than Basalt (with IMU). A reason might be that ORB-SLAM3 is using feature matching to find point correspondences, while Basalt is relying on optical flow. This correlation cannot be observed when running without IMU, where ORB-SLAM3 is failing. However, especially during the night and without IMU, the task becomes inordinately more difficult.

Figure11 provides performance comparisons between a sunny (office_loop2_test) and a cloudy (office_loop3_test) condition on the Office Loop scenario. Across all algorithms, one can observe improved performance during sunny weather conditions. A likely reason for this is the presence of more static feature points caused by shadows, especially on the road. This can be seen on the right side images in Figure11, where much more texture is on the road for the sunny than for the cloudy conditions.

While the evaluated methods show all-in-all good performance in good weather and lighting conditions, we believe that our dataset and benchmark will contribute to improving the performance in conditions with fewer and unreliable feature points. The results show that the proposed benchmark is highly challenging and still provides room for improving state-of-the-art VO algorithms.

(a)

(b)

Figure 10: Comparison of visual odometry performance for afternoon and night. The figure shows the performance for different state-of-the-art baseline VO algorithms on the same route for afternoon and night conditions. One can observe a significant drop in performance when going from day to night due to less visible landmarks.

(a)

(b)

Figure 11: Comparison of visual odometry performance for sunny and cloudy. The figure shows the performance for different state-of-the-art baseline VO algorithms on the same route for sunny and cloudy conditions. Across all algorithms, one can observe improved performance during sunny weather conditions. A likely reason for this is the presence of more static feature points caused by shadows, especially on the road. This can be seen on the right side images, where much more texture is on the road for the sunny than for the cloudy conditions.

6.2 Global Place Recognition

We evaluate the current state-of-the-art baseline deep image descriptors methods including NetVLAD 5 5 5https://github.com/cvg/Hierarchical-Localization(Arandjelovic et al, 2016) pretrained on Pittsburgh30k(Torii et al, 2013), Deep Image Retrieval (DIR)6 6 6https://github.com/naver/deep-image-retrieval(Gordo et al, 2017; Revaud et al, 2019a) (aka AP-GeM) trained on the Landmarks dataset(Babenko et al, 2014), and CNN Image Retrieval (CIR)7 7 7https://github.com/filipradenovic/cnnimageretrieval-pytorch(Radenović et al, 2016, 2018) trained on the dataset derived from(Schonberger et al, 2015). For each scenario of the 4Seasons benchmark, we use a predefined recording as the reference map and a predefined recording as the query map. Note that we leave out the Business Campus scenario for global place recognition.

As shown in Figure12, we first plot two different recall curves: (1) Recall[%] – Threshold [m] @ Top1: the recalls of different methods when changing the distance threshold in the range 1 m–20 m range times 1 meter times 20 meter 1\text{,}\mathrm{m}20\text{,}\mathrm{m}start_ARG start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG end_ARG – start_ARG start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG end_ARG using only the top 1 1 1 1 retrieved images, and (2) Recall[%] – Top N @ 1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG: the recalls of different methods when changing the number of candidate retrievals N∈{1,2,3,…,20}𝑁 1 2 3…20 N\in{1,2,3,\dots,20}italic_N ∈ { 1 , 2 , 3 , … , 20 } using the fixed range bound 1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG. We also show the optimal recall by using the closest database images with respect to the ground-truth query image location as the candidates. This shows the upper bound of the global place recognition accuracy.

From the results, we can see that NetVLAD still outperforms the other recent methods by a notable margin. One reason for NetVLAD’s superior performance could be the introduction of the inductive bias into the network design, based on the established principle of classical VLAD(Jégou et al, 2010; Arandjelovic and Zisserman, 2013). However, one shall also admit that the gap between the state-of-the-art methods and the optimal performance is still quite large, and more research on global place recognition still needs to be conducted.

We show the localization accuracy of the global place recognition methods without using local pose refinement. Note that for these methods we use the top 20 20 20 20 candidates and loosened range bounds, namely, (1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 5°times 5 degree 5\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG ° end_ARG) for high precision, (5 m times 5 meter 5\text{,}\mathrm{m}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 10°times 10 degree 10\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG ° end_ARG) for medium precision, and (10 m times 10 meter 10\text{,}\mathrm{m}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG, 20°times 20 degree 20\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG ° end_ARG) for coarse precision. The last three rows of Table4 show the individual global place recognition (GPR) performance on each of the evaluated scenarios from the 4Seasons benchmark.

Figure 12: Performance of state-of-the-art baseline global place recognition methods on the 4Seasons benchmark. The gray line indicates the upper bound for the global place recognition accuracy.

Figure 13: Performance of state-of-the-art baseline map-based visual localization approaches on the 4Seasons benchmark. The figure shows the cumulative localization accuracy against the translational, and rotational error, respectively.

Table 4: Visual localization results on the 4Seasons benchmark. We report the percentage of images localized within 0.1 m times 0.1 meter 0.1\text{,}\mathrm{m}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 1°times 1 degree 1\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG ° end_ARG, 0.25 m times 0.25 meter 0.25\text{,}\mathrm{m}start_ARG 0.25 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 2°times 2 degree 2\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG ° end_ARG, 1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 5°times 5 degree 5\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG ° end_ARG of the reference poses for map-based visual localization (MBVL) pipelines. For global place recognition (GPR) methods, we report the percentage of images localized within 1 m times 1 meter 1\text{,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 5°times 5 degree 5\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG ° end_ARG, 5 m times 5 meter 5\text{,}\mathrm{m}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 10°times 10 degree 10\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG ° end_ARG, 10 m times 10 meter 10\text{,}\mathrm{m}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG and 20°times 20 degree 20\text{,}\mathrm{\SIUnitSymbolDegree}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG ° end_ARG of the reference poses. The best-performing results for both MBVL and GPR pipelines are in bold.

Method Office Loop Neighborhood Business Campus Countryside City Loop Old Town Parking Garage Average MBVL hloc(Sarlin et al, 2019) (SuperPoint(DeTone et al, 2018) + SuperGlue(Sarlin et al, 2020))68.6 / 85.1 / 89.2 56.0 / 73.3 / 86.8 38.2 / 74.5 / 89.0 8.0 / 29.2 / 64.7 37.0 / 72.6 / 83.2 27.1 / 43.2 / 58.4 46.9 / 63.7 / 76.1 40.3 / 63.1 / 78.2 hloc(Sarlin et al, 2019) (D2-Net(Dusmanu et al, 2019) + NN)43.3 / 71.5 / 88.3 28.7 / 54.6 / 85.1 22.4 / 62.2 / 85.5 3.4 / 18.6 / 63.4 14.9 / 50.0 / 86.2 9.7 / 26.3 / 47.9 28.3 / 52.2 / 76.1 21.5 / 47.9 / 76.1 hloc(Sarlin et al, 2019) (SIFT(Lowe, 2004) + NN)24.6 / 39.3 / 52.6 33.6 / 50.6 / 66.1 3.5 / 9.9 / 18.9 0.1 / 0.6 / 2.4 9.0 / 22.3 / 40.3 0.0 / 0.3 / 4.0 9.7 / 16.8 / 27.4 11.5 / 20.0 / 30.2 hloc(Sarlin et al, 2019) (R2D2(Revaud et al, 2019b) + NN)66.4 / 83.1 / 88.3 59.8 / 81.9 / 96.0 36.9 / 68.6 / 82.9 3.8 / 18.1 / 53.6 36.1 / 77.2 / 92.0 16.5 / 27.8 / 40.0 41.6 / 66.4 / 78.8 37.3 / 60.4 / 75.9 GPR NetVLAD(Arandjelovic et al, 2016)55.4 / 90.2 / 93.5 49.5 / 80.6 / 83.5–10.6 / 29.9 / 34.1 24.6 / 54.0 / 62.3 30.8 / 64.3 / 79.2 37.9 / 79.3 / 86.2 34.8 / 66.4 / 73.1 CNN Image Retrieval (CIR)(Radenović et al, 2018)33.7 / 66.3 / 71.7 43.7 / 71.8 / 78.6–6.8 / 22.0 / 28.8 6.7 / 25.0 / 31.3 14.9 / 45.7 / 62.0 27.6 / 55.2 / 65.5 22.2 / 47.7 / 56.3 Deep Image Retrieval (DIR)(Revaud et al, 2019a)26.1 / 58.7 / 68.5 41.7 / 72.8 / 79.6–10.2 / 23.9 / 29.5 7.9 / 24.2 / 30.6 19.0 / 43.0 / 63.3 27.6 / 79.3 / 89.7 22.1 / 50.3 / 60.2

6.3 Map-Based Visual Localization

For the evaluation of map-based visual localization, we use the following processing pipeline: we first build a SfM model from the reference scene that provides correspondences between local features and 3D points in the reconstructed map. This is followed by 2D-3D matching between the query images and the database images. As the last step, those 2D-3D matches are used for camera pose estimation via Perspective-n-Point (PnP) and random sample consensus (RANSAC)(Fischler and Bolles, 1981). In particular, we evaluate the current state-of-the-art coarse-to-fine hierarchical localization method(Sarlin et al, 2019) based on the following learned deep local feature descriptors: SuperPoint(DeTone et al, 2018), D2-Net(Dusmanu et al, 2019), and R2D2(Revaud et al, 2019b). Additionally, we use the classic scale-invariant feature transform (SIFT)(Lowe, 2004) algorithm. Hloc(Sarlin et al, 2019) simultaneously predicts local features and global descriptors for accurate 6DoF localization. This approach first performs global place recognition to obtain location candidates, and afterward matches the local features only within those places. The extracted local image features are used to establish 2D-3D matches within a pre-built SfM model. Pose estimation is performed using COLMAP(Schönberger and Frahm, 2016). Therefore, this pipeline can be seen as a pose refinement strategy. For each scenario of the 4Seasons benchmark, we use a predefined recording as the reference map and a predefined recording as the query.

Figure13 shows the percentage of correctly classified queries when changing the distance and orientation thresholds, respectively. This figure shows the average performance for the different state-of-the-art map-based visual localization (MBVL) approaches across all evaluated scenarios. The first four rows of Table4 show the individual hierarchical localization performance on each of the evaluated scenarios from the 4Seasons benchmark.

From the results, we can see that the classic SIFT+ nearest neighbor (NN) approach shows a bad performance estimating the 6DoF pose between the reference and query candidates. The results also suggest that deep learning-based algorithms dramatically outperform the classical method. This is due to the challenging nature of the benchmark since it deals with drastic lighting and illumination changes, occlusions, and changing environmental/weather conditions. Those results provide valuable insights into the limitations and failure cases of the different methods. We observe that learned feature descriptors significantly outperform classic methods under challenging conditions contained in the 4Seasons benchmark, with SuperPoint+SuperGlue yielding the best results overall. Nevertheless, the results from Table4 show that the long-term localization problem is still far from being solved, especially for highly dynamic environments (e.g.Old Town) and scenes that exhibit very similar structure (e.g.Countryside).

Our benchmark provides the basis to enable more research advances that are needed to close this performance gap.

7 Conclusion

Current benchmarks either focus mainly on evaluating the performance of simultaneous localization and mapping algorithms or visual localization in isolation. To close this gap, we introduce a benchmark that by providing a holistic way for jointly benchmarking long-term visual SLAM and localization.

In this paper, we have introduced a comprehensive benchmark suite for visual SLAM and visual localization for autonomous driving under challenging conditions. The benchmark covers a huge variety of environmental conditions, along with short-term and long-term weather and illumination changes. Moreover, we have reviewed and evaluated the current state-of-the-art baseline approaches for visual SLAM and visual localization. We have observed large performance gaps and see huge potential in future work to close those gaps.

Figure 14: Dataset overview. Example images from our benchmark dataset. First row: office loop, second row: highway, third row: neighborhood, fourth row: business campus, fifth row: countryside, sixth row: city loop, seventh row: old town, eighth row: maximilianeum, ninth row: parking garage. The figure illustrates the large appearance changes, occlusions, seasonal, and structural changes present in the data.

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

Figure 15: Scenarios overview. This figure shows all the covered scenarios of our benchmark dataset. We provide vastly different environments in and around the city of Munich, Germany.

Table 5: Statistics of the 4Seasons dataset. This table shows the different scenarios and recordings along with the weather condition, seasons, and time of the day from our benchmark. We provide a variety of scenarios and short-term to long-term changes. These recordings are all released with ground truth (GNSS/IMU, point clouds, and reference poses) and can therefore be used for training learning-based techniques.

Scenario Recording Weather(cloudy, rainy, snowy, sunny)Season(winter, spring, summer, fall)Daytime(morning, afternoon, evening, night)Map Accuracy Horizontal RMSE(GNSS-Ref. Pose)Map Accuracy% of Accurate Poses office_loop_1_train 2020-03-24_17-36-22 sunny spring afternoon 6.84 cm times 6.84 centimeter 6.84\text{,}\mathrm{cm}start_ARG 6.84 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 85.93%times 85.93 percent 85.93\text{,}\mathrm{\char 37\relax}start_ARG 85.93 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_2_train 2020-03-24_17-45-31 sunny spring afternoon 6.34 cm times 6.34 centimeter 6.34\text{,}\mathrm{cm}start_ARG 6.34 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 86.92%times 86.92 percent 86.92\text{,}\mathrm{\char 37\relax}start_ARG 86.92 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_3_train 2020-04-07_10-20-32 sunny spring morning 5.44 cm times 5.44 centimeter 5.44\text{,}\mathrm{cm}start_ARG 5.44 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 77.72%times 77.72 percent 77.72\text{,}\mathrm{\char 37\relax}start_ARG 77.72 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_4_train 2020-06-12_10-10-57 sunny summer morning 2.74 cm times 2.74 centimeter 2.74\text{,}\mathrm{cm}start_ARG 2.74 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 54.01%times 54.01 percent 54.01\text{,}\mathrm{\char 37\relax}start_ARG 54.01 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_5_train 2021-01-07_12-04-03 cloudy/snowy winter afternoon 3.79 cm times 3.79 centimeter 3.79\text{,}\mathrm{cm}start_ARG 3.79 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 96.00%times 96.00 percent 96.00\text{,}\mathrm{\char 37\relax}start_ARG 96.00 end_ARG start_ARG times end_ARG start_ARG % end_ARG office_loop_6_train 2021-02-25_13-51-57 sunny winter afternoon 2.90 cm times 2.90 centimeter 2.90\text{,}\mathrm{cm}start_ARG 2.90 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 91.45%times 91.45 percent 91.45\text{,}\mathrm{\char 37\relax}start_ARG 91.45 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_1_train 2020-03-26_13-32-55 cloudy spring afternoon 4.13 cm times 4.13 centimeter 4.13\text{,}\mathrm{cm}start_ARG 4.13 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 56.71%times 56.71 percent 56.71\text{,}\mathrm{\char 37\relax}start_ARG 56.71 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_2_train 2020-10-07_14-47-51 cloudy fall afternoon 1.19 cm times 1.19 centimeter 1.19\text{,}\mathrm{cm}start_ARG 1.19 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 85.00%times 85.00 percent 85.00\text{,}\mathrm{\char 37\relax}start_ARG 85.00 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_3_train 2020-10-07_14-53-52 rainy fall afternoon 2.00 cm times 2.00 centimeter 2.00\text{,}\mathrm{cm}start_ARG 2.00 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 84.12%times 84.12 percent 84.12\text{,}\mathrm{\char 37\relax}start_ARG 84.12 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_4_train 2020-12-22_11-54-24 cloudy winter morning 3.47 cm times 3.47 centimeter 3.47\text{,}\mathrm{cm}start_ARG 3.47 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 87.92%times 87.92 percent 87.92\text{,}\mathrm{\char 37\relax}start_ARG 87.92 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_5_train 2021-02-25_13-25-15 sunny winter afternoon 2.45 cm times 2.45 centimeter 2.45\text{,}\mathrm{cm}start_ARG 2.45 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 86.23%times 86.23 percent 86.23\text{,}\mathrm{\char 37\relax}start_ARG 86.23 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_6_train 2021-05-10_18-02-12 cloudy spring evening 1.74 cm times 1.74 centimeter 1.74\text{,}\mathrm{cm}start_ARG 1.74 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 69.43%times 69.43 percent 69.43\text{,}\mathrm{\char 37\relax}start_ARG 69.43 end_ARG start_ARG times end_ARG start_ARG % end_ARG neighborhood_7_train 2021-05-10_18-32-32 cloudy spring evening 1.44 cm times 1.44 centimeter 1.44\text{,}\mathrm{cm}start_ARG 1.44 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 85.45%times 85.45 percent 85.45\text{,}\mathrm{\char 37\relax}start_ARG 85.45 end_ARG start_ARG times end_ARG start_ARG % end_ARG business_campus_1_train 2020-10-08_09-30-57 sunny fall morning 5.49 cm times 5.49 centimeter 5.49\text{,}\mathrm{cm}start_ARG 5.49 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 83.08%times 83.08 percent 83.08\text{,}\mathrm{\char 37\relax}start_ARG 83.08 end_ARG start_ARG times end_ARG start_ARG % end_ARG business_campus_2_train 2021-01-07_13-12-23 cloudy/snowy winter afternoon 1.77 cm times 1.77 centimeter 1.77\text{,}\mathrm{cm}start_ARG 1.77 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 99.13%times 99.13 percent 99.13\text{,}\mathrm{\char 37\relax}start_ARG 99.13 end_ARG start_ARG times end_ARG start_ARG % end_ARG business_campus_3_train 2021-02-25_14-16-43 sunny winter afternoon 7.33 cm times 7.33 centimeter 7.33\text{,}\mathrm{cm}start_ARG 7.33 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 66.86%times 66.86 percent 66.86\text{,}\mathrm{\char 37\relax}start_ARG 66.86 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_1_train 2020-04-07_11-33-45 sunny spring morning 3.96 cm times 3.96 centimeter 3.96\text{,}\mathrm{cm}start_ARG 3.96 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 90.89%times 90.89 percent 90.89\text{,}\mathrm{\char 37\relax}start_ARG 90.89 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_2_train 2020-06-12_11-26-43 sunny summer morning 2.54 cm times 2.54 centimeter 2.54\text{,}\mathrm{cm}start_ARG 2.54 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 87.00%times 87.00 percent 87.00\text{,}\mathrm{\char 37\relax}start_ARG 87.00 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_3_train 2020-10-08_09-57-28 sunny fall morning 1.94 cm times 1.94 centimeter 1.94\text{,}\mathrm{cm}start_ARG 1.94 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 89.37%times 89.37 percent 89.37\text{,}\mathrm{\char 37\relax}start_ARG 89.37 end_ARG start_ARG times end_ARG start_ARG % end_ARG countryside_4_train 2021-01-07_13-30-07 cloudy/snowy winter afternoon 5.42 cm times 5.42 centimeter 5.42\text{,}\mathrm{cm}start_ARG 5.42 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 92.02%times 92.02 percent 92.02\text{,}\mathrm{\char 37\relax}start_ARG 92.02 end_ARG start_ARG times end_ARG start_ARG % end_ARG city_loop_1_train 2020-12-22_11-33-15 rainy winter morning 6.85 cm times 6.85 centimeter 6.85\text{,}\mathrm{cm}start_ARG 6.85 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 83.08%times 83.08 percent 83.08\text{,}\mathrm{\char 37\relax}start_ARG 83.08 end_ARG start_ARG times end_ARG start_ARG % end_ARG city_loop_2_train 2021-01-07_14-36-17 snowy/sunny winter afternoon 4.76 cm times 4.76 centimeter 4.76\text{,}\mathrm{cm}start_ARG 4.76 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 84.27%times 84.27 percent 84.27\text{,}\mathrm{\char 37\relax}start_ARG 84.27 end_ARG start_ARG times end_ARG start_ARG % end_ARG city_loop_3_train 2021-02-25_11-09-49 sunny winter morning 3.41 cm times 3.41 centimeter 3.41\text{,}\mathrm{cm}start_ARG 3.41 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 85.14%times 85.14 percent 85.14\text{,}\mathrm{\char 37\relax}start_ARG 85.14 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_1_train 2020-10-08_11-53-41 cloudy fall morning 2.90 cm times 2.90 centimeter 2.90\text{,}\mathrm{cm}start_ARG 2.90 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 93.76%times 93.76 percent 93.76\text{,}\mathrm{\char 37\relax}start_ARG 93.76 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_2_train 2021-01-07_10-49-45 cloudy/snowy/sunny winter morning 1.80 cm times 1.80 centimeter 1.80\text{,}\mathrm{cm}start_ARG 1.80 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 93.16%times 93.16 percent 93.16\text{,}\mathrm{\char 37\relax}start_ARG 93.16 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_3_train 2021-02-25_12-34-08 sunny winter afternoon 1.43 cm times 1.43 centimeter 1.43\text{,}\mathrm{cm}start_ARG 1.43 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 83.12%times 83.12 percent 83.12\text{,}\mathrm{\char 37\relax}start_ARG 83.12 end_ARG start_ARG times end_ARG start_ARG % end_ARG old_town_4_train 2021-05-10_21-32-00 cloudy spring night 13.45 cm times 13.45 centimeter 13.45\text{,}\mathrm{cm}start_ARG 13.45 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 95.81%times 95.81 percent 95.81\text{,}\mathrm{\char 37\relax}start_ARG 95.81 end_ARG start_ARG times end_ARG start_ARG % end_ARG parking_garage_1_train 2020-12-22_12-04-35 cloudy winter afternoon 1.43 cm times 1.43 centimeter 1.43\text{,}\mathrm{cm}start_ARG 1.43 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 33.22%times 33.22 percent 33.22\text{,}\mathrm{\char 37\relax}start_ARG 33.22 end_ARG start_ARG times end_ARG start_ARG % end_ARG parking_garage_2_train 2021-02-25_13-39-06 sunny winter afternoon 2.52 cm times 2.52 centimeter 2.52\text{,}\mathrm{cm}start_ARG 2.52 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 40.54%times 40.54 percent 40.54\text{,}\mathrm{\char 37\relax}start_ARG 40.54 end_ARG start_ARG times end_ARG start_ARG % end_ARG parking_garage_3_train 2021-05-10_19-15-19 cloudy spring evening 3.41 cm times 3.41 centimeter 3.41\text{,}\mathrm{cm}start_ARG 3.41 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG 34.15%times 34.15 percent 34.15\text{,}\mathrm{\char 37\relax}start_ARG 34.15 end_ARG start_ARG times end_ARG start_ARG % end_ARG

\bmhead

Acknowledgments

We express our appreciation to our colleagues at Artisense for their help with setting up the recording setup and sensor design.

Declarations

\bmhead

Competing Interests

The authors declare that they have no conflict of interest.

References

\bibcommenthead
Angeli et al (2008) Angeli A, Filliat D, Doncieux S, et al (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions on Robotics (T-RO) 24(5):1027–1037
Arandjelovic and Zisserman (2013) Arandjelovic R, Zisserman A (2013) All about VLAD. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1578–1585
Arandjelovic et al (2016) Arandjelovic R, Gronat P, Torii A, et al (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5297–5307
Babenko et al (2014) Babenko A, Slesarev A, Chigorin A, et al (2014) Neural codes for image retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 584–599
Badino et al (2011) Badino H, Huber D, Kanade T (2011) Visual topometric localization. In: Proceedings of the IEEE Intelligent Vehicles Symposium (IV), pp 794–799
Blanco-Claraco et al (2014) Blanco-Claraco JL, Ángel Moreno-Dueñas F, González-Jiménez J (2014) The Málaga urban dataset: High-rate stereo and LiDAR in a realistic urban scenario. International Journal of Robotics Research (IJRR) 33(2):207–214
Burri et al (2016) Burri M, Nikolic J, Gohl P, et al (2016) The EuRoC micro aerial vehicle datasets. International Journal of Robotics Research (IJRR) 35(10):1157–1163
Caesar et al (2020) Caesar H, Bankiti V, Lang AH, et al (2020) nuScenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 11,621–11,631
Campos et al (2020) Campos C, Elvira R, Gómez JJ, et al (2020) ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. In: arXiv preprint arXiv:2007.11898
Cordts et al (2016) Cordts M, Omran M, Ramos S, et al (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3213–3223
DeTone et al (2018) DeTone D, Malisiewicz T, Rabinovich A (2018) SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 224–236
Dusmanu et al (2019) Dusmanu M, Rocco I, Pajdla T, et al (2019) D2-Net: A trainable CNN for joint detection and description of local features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 8092–8101
Engel et al (2014) Engel J, Schöps T, Cremers D (2014) LSD-SLAM: Large-scale direct monocular SLAM. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 834–849
Engel et al (2015) Engel J, Stückler J, Cremers D (2015) Large-scale direct SLAM with stereo cameras. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp 1935–1942
Engel et al (2016) Engel J, Usenko V, Cremers D (2016) A photometrically calibrated benchmark for monocular visual odometry. In: arXiv preprint arXiv:1607.02555
Engel et al (2017) Engel J, Koltun V, Cremers D (2017) Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 40(3):611–625
Fischler and Bolles (1981) Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6):381–395
Gálvez-López and Tardos (2012) Gálvez-López D, Tardos JD (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics (T-RO) 28(5):1188–1197
Geiger et al (2012) Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3354–3361
Geiger et al (2013) Geiger A, Lenz P, Stiller C, et al (2013) Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) 32(11):1231–1237
Gordo et al (2016) Gordo A, Almazán J, Revaud J, et al (2016) Deep image retrieval: Learning global representations for image search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 241–257
Gordo et al (2017) Gordo A, Almazan J, Revaud J, et al (2017) End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision (IJCV) 124(2):237–254
Hartley et al (2013) Hartley R, Trumpf J, Dai Y, et al (2013) Rotation averaging. International Journal of Computer Vision (IJCV) 103(3):267–305
He et al (2016) He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778
Hu and de Haan (2006) Hu H, de Haan G (2006) Low cost robust blur estimator. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp 617–620
Huang et al (2018) Huang X, Cheng X, Geng Q, et al (2018) The ApolloScape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 954–960
Jafarzadeh et al (2021) Jafarzadeh A, Antequera ML, Gargallo P, et al (2021) Crowddriven: A new challenging dataset for outdoor visual localization. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 9845–9855
Jaramillo (2017) Jaramillo C (2017) Direct multichannel tracking. In: Proceedings of the International Conference on 3D Vision (3DV), pp 347–355
Jégou et al (2010) Jégou H, Douze M, Schmid C, et al (2010) Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3304–3311
Jung et al (2019) Jung E, Yang N, Cremers D (2019) Multi-frame GAN: Image enhancement for stereo visual odometry in low light. In: Conference on Robot Learning (CoRL), pp 651–660
Kannala and Brandt (2006) Kannala J, Brandt SS (2006) A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28(8):1335–1340
Kendall et al (2015) Kendall A, Grimes M, Cipolla R (2015) Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 2938–2946
Kenk and Hassaballah (2020) Kenk MA, Hassaballah M (2020) Dawn: Vehicle detection in adverse weather nature. In: arXiv preprint arXiv:2008.05402
Krizhevsky et al (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (NIPS), pp 1097–1105
Kümmerle et al (2011) Kümmerle R, Grisetti G, Strasdat H, et al (2011) g2o: A general framework for graph optimization. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp 3607–3613
Lowe (2004) Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60(2):91–110
Lowry et al (2015) Lowry S, Sünderhauf N, Newman P, et al (2015) Visual place recognition: A survey. IEEE Transactions on Robotics (T-RO) 32(1):1–19
Maddern et al (2017) Maddern W, Pascoe G, Linegar C, et al (2017) 1 year, 1000 km: The oxford robotcar dataset. International Journal of Robotics Research (IJRR) 36(1):3–15
Mur-Artal and Tardós (2017) Mur-Artal R, Tardós JD (2017) ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics (T-RO) 33(5):1255–1262
Mur-Artal et al (2015) Mur-Artal R, Montiel JMM, Tardos JD (2015) ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics (T-RO) 31(5):1147–1163
Newcombe et al (2011) Newcombe RA, Lovegrove SJ, Davison AJ (2011) DTAM: dense tracking and mapping in real-time. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 2320–2327
Pitropov et al (2021) Pitropov M, Garcia DE, Rebello J, et al (2021) Canadian adverse driving conditions datasett. International Journal of Robotics Research (IJRR) 40(4-5):681–690
Radenović et al (2016) Radenović F, Tolias G, Chum O (2016) CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–20
Radenović et al (2018) Radenović F, Tolias G, Chum O (2018) Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 41(7):1655–1668
Rehder et al (2016) Rehder J, Nikolic J, Schneider T, et al (2016) Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp 4304–4311
Revaud et al (2019a) Revaud J, Almazan J, Rezende R, et al (2019a) Learning with average precision: Training image retrieval with a listwise loss. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 5107–5116
Revaud et al (2019b) Revaud J, Weinzaepfel P, de Souza CR, et al (2019b) R2D2: repeatable and reliable detector and descriptor. In: Neural Information Processing Systems (NeurIPS), pp 12,405–12,415
Sarlin et al (2019) Sarlin PE, Cadena C, Siegwart R, et al (2019) From coarse to fine: Robust hierarchical localization at large scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Sarlin et al (2020) Sarlin PE, DeTone D, Malisiewicz T, et al (2020) SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Sarlin et al (2022) Sarlin PE, Dusmanu M, Schönberger JL, et al (2022) Lamar: Benchmarking localization and mapping for augmented reality. In: Proceedings of the European Conference on Computer Vision (ECCV)
Sattler et al (2012) Sattler T, Weyand T, Leibe B, et al (2012) Image retrieval for image-based localization revisited. In: Proceedings of the British Machine Vision Conference (BMVC)
Sattler et al (2018) Sattler T, Maddern W, Toft C, et al (2018) Benchmarking 6DOF outdoor visual localization in changing conditions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 8601–8610
Schönberger and Frahm (2016) Schönberger JL, Frahm JM (2016) Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4104–4113
Schonberger et al (2015) Schonberger JL, Radenovic F, Chum O, et al (2015) From single image query to detailed 3D reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5126–5134
Schubert et al (2018) Schubert D, Goll T, Demmel N, et al (2018) The TUM VI benchmark for evaluating visual-inertial odometry. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp 1680–1687
Simonyan and Zisserman (2015) Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR)
Spencer et al (2020) Spencer J, Bowden R, Hadfield S (2020) Same features, different day: Weakly supervised feature learning for seasonal invariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6459–6468
von Stumberg et al (2020) von Stumberg L, Wenzel P, Khan Q, et al (2020) GN-Net: The gauss-newton loss for multi-weather relocalization. IEEE Robotics and Automation Letters (RA-L) 5(2):890–897
Sturm et al (2012) Sturm J, Engelhard N, Endres F, et al (2012) A benchmark for the evaluation of RGB-D SLAM systems. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), pp 573–580
Szegedy et al (2015) Szegedy C, Liu W, Jia Y, et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1–9
Taira et al (2018) Taira H, Okutomi M, Sattler T, et al (2018) Inloc: Indoor visual localization with dense matching and view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7199–7209
Tolias et al (2015) Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. In: arXiv preprint arXiv:1511.05879
Torii et al (2013) Torii A, Sivic J, Pajdla T, et al (2013) Visual place recognition with repetitive structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 883–890
Torii et al (2015) Torii A, Arandjelovic R, Sivic J, et al (2015) 24/7 place recognition by view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1808–1817
Umeyama (1991) Umeyama S (1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 13(4):376–380
Usenko et al (2019) Usenko V, Demmel N, Schubert D, et al (2019) Visual-inertial mapping with non-linear factor recovery. IEEE Robotics and Automation Letters (RA-L) 5(2):422–429
Valentin et al (2016) Valentin J, Dai A, Nießner M, et al (2016) Learning to navigate the energy landscape. In: Proceedings of the International Conference on 3D Vision (3DV), pp 323–332
Von Stumberg et al (2018) Von Stumberg L, Usenko V, Cremers D (2018) Direct sparse visual-inertial odometry using dynamic marginalization. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp 2510–2517
Wang et al (2017a) Wang R, Schwörer M, Cremers D (2017a) Stereo DSO: Large-scale direct sparse visual odometry with stereo cameras. In: Proceedings of the International Conference on Computer Vision (ICCV), pp 3903–3911
Wang et al (2017b) Wang S, Bai M, Mattyus G, et al (2017b) TorontoCity: Seeing the world with a million eyes. In: Proceedings of the International Conference on Computer Vision (ICCV)
Warburg et al (2020) Warburg F, Hauberg S, Lopez-Antequera M, et al (2020) Mapillary street-level sequences: A dataset for lifelong place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2626–2635
Wenzel et al (2020) Wenzel P, Wang R, Yang N, et al (2020) 4Seasons: A cross-season dataset for multi-weather SLAM in autonomous driving. In: Proceedings of the German Conference on Pattern Recognition (GCPR)
Yang et al (2018) Yang N, Wang R, Stückler J, et al (2018) Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 817–833

Xet Storage Details

Size:: 133 kB
Xet hash:: e6daf98182ba648e089c3a5d0f741e91ce3e503ea3ae3410d87a0d8615b3fac3

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.