SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models
Abstract
Visual Foundation Models demonstrate limited spatial reasoning capabilities despite excelling in semantic image understanding, prompting the introduction of SpaRRTa benchmark to evaluate relative object positioning abilities.
Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning (2026)
- SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning (2026)
- OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams (2026)
- Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding (2026)
- Cog3DMap: Multi-View Vision-Language Reasoning with 3D Cognitive Maps (2026)
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)
- SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2601.11729 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash