Papers
arxiv:2510.26909

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Published on Oct 30, 2025
· Submitted by
Moritz Reuss
on Nov 4, 2025
Authors:
,
,
,
,

Abstract

NaviTrace is a Visual Question Answering benchmark for evaluating robotic navigation capabilities using a semantic-aware trace score across various scenarios and embodiment types.

AI-generated summary

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

Community

Paper author Paper submitter

NaviTrace is a novel VQA benchmark for evaluating vision–language models (VLMs) on their embodiment-specific understanding of navigation across diverse real-world scenarios. Given a natural-language instruction and an embodiment type (human, legged robot, wheeled robot, bicycle), a model must output a 2D navigation path in image space, which we call a trace.

The benchmark includes 1000 scenarios with 3000+ expert traces, divided into:

  • Validation split (50%) for experimentation and model fine-tuning.
  • Test split (50%) with hidden ground truths for a fair leaderboard evaluation.

The dataset is available on Hugging Face. We provide ready-to-use evaluation scripts for API-based model inference and scoring, along with a leaderboard where users can compute scores on the test split and optionally submit their models for public comparison.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2510.26909
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.26909 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1