PureSpace: A Benchmark for Abstract Spatial Reasoning in Vision-Language Models

Basic Information

Abstract

Spatial reasoning remains a persistent challenge for Vision Language Models (VLMs). Toward this end, we introduce a new benchmark PURESPACE based on abstract geometric objects, isolating three core tasks: rotation, projection, and completion. Our experiments reveal that state-of-the art models achieve only modest performance, with accuracy showing no clear relationship with task difficulty, suggesting a lack of genuine spatial understanding. Furthermore, we find that while specialized models can excel at a single task, they fail to generalize and drop to near-random accuracy on unseen tasks. To overcome these shortcomings, we propose a cognitively-inspired framework that decomposes the problem: a perception module represents the geometric structure, a language model infers the viewpoint transformations, and a renderer synthesizes the target-view appearance, which are finally leveraged by a VLM to determine the correct answer. Experiments show that our method achieves substantial improvements on all three tasks, and provides enhanced interpretability and robustness.