intermediate

5 min read

•Sunday, May 31, 2026

Beyond 2D: Unlocking Real-Time 3D Vision for Smarter AI Agents

Current AI struggles with truly understanding moving objects in dynamic 3D environments. This new research introduces GMOS, a groundbreaking framework that processes RGB video to provide 3D-aware, fine-grained segmentation and tracking of multiple moving objects, opening doors for unprecedented accuracy in robotics, autonomous systems, and more.

Original paper: 2605.30352v1

Authors:Junyu XieTengda HanWeidi XieAndrew Zisserman

Key Takeaways

1. GMOS introduces a novel framework for Moving Object Segmentation (MOS) that operates directly on RGB video, providing 3D-aware and temporally fine-grained segmentation.
2. It overcomes fundamental limitations of prior methods, which rely on pre-computed 2D data (lacking 3D geometry) and treat motion as a sequence-level attribute.
3. GMOS enables the understanding of instantaneous motion states for multiple independent objects, crucial for dynamic real-world applications.
4. The research includes the new GMOS-2K dataset and MOS-I evaluation protocol, pushing the boundaries for training and assessing 3D-aware motion understanding.
5. GMOS achieves state-of-the-art results, runs significantly faster than previous multi-object MOS methods, and supports online inference for real-time deployment.

The Paper in 60 Seconds

Imagine an AI system that doesn't just see objects moving, but *understands* their instantaneous 3D motion and position, all from a standard video feed. That's the core promise of GMOS (Grounding Moving Object Segmentation in 3D Space and Time). This research tackles two major limitations in current Moving Object Segmentation (MOS): reliance on pre-computed, purely 2D data like optical flow (which lacks depth) and treating motion as a broad, sequence-level attribute rather than an instantaneous state for each object.

GMOS directly processes RGB video to deliver 3D-aware, temporally fine-grained segmentation of multiple moving objects. It introduces a new dataset, GMOS-2K, and an evaluation protocol, MOS-I, specifically designed for this richer understanding of motion. The result? State-of-the-art performance, significantly faster processing, and the capability for online inference – making it ready for real-world streaming deployment.

Why This Matters for Developers and AI Builders

In an increasingly dynamic world, your AI agents need more than just a flat, 2D understanding of movement. Whether you're building autonomous vehicles, sophisticated robotics, immersive AR/VR experiences, or advanced surveillance systems, the ability to accurately perceive and predict the 3D motion of objects in real-time is paramount.

Traditional approaches to Moving Object Segmentation (MOS) have hit a wall. They typically rely on auxiliary 2D data like optical flow – essentially, tracking pixels across frames. While useful, this approach has inherent limitations:

• Lack of 3D Geometry: Optical flow tells you *where a pixel moved on the 2D image plane*, but not *how far that object moved in 3D space* or its actual depth relative to the camera. This is critical for collision avoidance, precise manipulation, or realistic AR occlusion.

• Sequence-Level Motion: Most methods treat motion as a general property of a video sequence, rather than understanding the instantaneous motion state of *each individual object*. This means they might miss subtle, rapid changes in an object's trajectory, leading to delayed reactions or less accurate predictions.

This is where GMOS steps in as a game-changer. By grounding MOS in 3D space and time, GMOS provides a foundational shift that empowers developers to build more robust, intelligent, and reactive AI systems. Imagine AI that doesn't just see a car moving, but understands its speed, direction, and depth *right now*, predicting its immediate future trajectory with far greater accuracy. This isn't just an incremental improvement; it's a leap towards truly spatially and temporally aware AI.

What GMOS Found: A Deeper Dive

The researchers behind GMOS identified and directly addressed the core limitations of existing MOS methods:

1.The 2D Data Dependency Problem: Current methods often start with pre-computed 2D optical flow or point trajectories. While these inputs provide motion cues, they fundamentally lack 3D geometric information. This means the AI is inferring 3D motion from inherently 2D data, which is like trying to understand a 3D sculpture by only looking at its shadow. GMOS bypasses this by operating directly on RGB video, learning to infer 3D motion and segmentation simultaneously. This direct approach allows the model to build a richer, intrinsically 3D representation of the scene and its moving components.

2.The Sequence-Level Motion Problem: Many MOS methods treat motion as a general attribute of a video segment. This overlooks the crucial aspect of instantaneous motion state for *each individual object*. An object's motion can change rapidly; an AI needs to understand these immediate shifts to react effectively. GMOS is designed to be temporally fine-grained, meaning it can pinpoint the motion of objects at a much finer temporal resolution, capturing those critical instantaneous changes.

To achieve this, GMOS proposes a novel framework that:

• Operates on RGB Video: No need for external, pre-computed 2D motion cues. The model learns everything it needs directly from the raw video frames.

• Produces 3D-aware Segmentations: The output isn't just a 2D mask; it's a segmentation that implicitly understands the object's position and movement within a 3D scene. This means better understanding of depth, occlusion, and spatial relationships.

• Supports Multiple Moving Objects: GMOS can simultaneously segment and track the instantaneous motion of several independent objects within the same scene.

• Offers GMOS-S for Speed: For scenarios requiring faster deployment or where a simpler foreground-background distinction is sufficient, GMOS-S provides a streamlined variant.

Enabling Training and Evaluation:

To support this new paradigm, the team also curated GMOS-2K, a significant dataset comprising 2,210 real-world videos. This dataset is unique because it includes per-object temporal motion annotations sourced from five established Video Object Segmentation (VOS) benchmarks. This rich annotation allows for training models to understand true 3D motion over time.

Furthermore, they formalized MOS-I (MOS for "Instantaneous"), a new evaluation protocol with three complementary metrics designed to specifically assess the temporally fine-grained understanding of object motion. This ensures that models are evaluated not just on overall segmentation accuracy, but on their ability to capture those critical instantaneous movements.

Performance and Practicality:

GMOS doesn't just introduce a theoretical improvement; it delivers state-of-the-art results across standard MOS, the new MOS-I, and Unsupervised VOS benchmarks. Crucially for developers, it runs significantly faster than prior multi-object MOS methods and, perhaps most importantly, supports online inference. This means GMOS can process video streams in real-time, making it viable for live deployment in applications like autonomous systems and robotics.

How You Can Build with GMOS: Cross-Industry Applications

The implications of GMOS are far-reaching, enabling developers to create more intelligent and reliable AI solutions across diverse industries:

• Robotics & Industrial Automation: Imagine a robotic arm on an assembly line that needs to pick a specific component from a continuously moving conveyor belt, while also avoiding unexpected human intervention. GMOS allows robots to precisely track the instantaneous 3D position and trajectory of objects and people, enhancing both efficiency and safety. This means more flexible manufacturing lines, safer human-robot collaboration, and more precise manipulation in dynamic environments.

• Autonomous Systems (Vehicles, Drones, Shipping): For self-driving cars, delivery drones, or autonomous forklifts in a warehouse, understanding the precise 3D motion of pedestrians, other vehicles, and obstacles is the difference between safe operation and an accident. GMOS provides the enhanced perception layer needed to accurately predict the immediate future trajectories of surrounding agents, leading to more robust collision avoidance and smarter path planning.

• Augmented Reality (AR) & Virtual Reality (VR): In AR, virtual objects need to interact realistically with the real world. If a real person walks in front of a virtual character, the character should be occluded correctly. GMOS enables dynamic, real-time 3D scene understanding, allowing for highly realistic occlusion effects and more immersive interactions where virtual elements seamlessly integrate with moving physical objects. This can revolutionize gaming, training simulations, and interactive experiences.

• Smart City & Public Safety: In a smart city context, GMOS can power advanced surveillance systems. Instead of just detecting "motion," it can identify and track specific individuals or objects, understand their instantaneous 3D movement patterns, and flag anomalies with higher precision. This leads to proactive incident response, optimized traffic flow management, and more effective public safety measures, reducing false positives and improving situational awareness.

• Sports Analytics & Biomechanics: Imagine a system that can track every player and the ball in a complex sports environment, not just in 2D, but with full 3D positional and instantaneous velocity data. GMOS could unlock unprecedented insights into player movement, tactical analysis, and biomechanical performance, revolutionizing training, game strategy, and injury prevention.

The Future is 3D and Instantaneous

GMOS represents a significant step forward in computer vision, moving beyond the limitations of 2D approximations to embrace a true 3D understanding of dynamic scenes. For developers, this means the tools are now available to build AI agents that perceive and react to the world with a level of sophistication previously out of reach. It's time to think in 3D, and build for instantaneous action.

Cross-Industry Applications

Robotics & Industrial Automation

Real-time, precise object manipulation on conveyor belts or in dynamic workspaces, enabling safer human-robot collaboration.

Significantly improves safety, efficiency, and flexibility of automated manufacturing and logistics operations.

Autonomous Systems

Enhanced perception and prediction of surrounding moving objects (pedestrians, other vehicles, obstacles) for collision avoidance and path planning in self-driving cars and drones.

Drastically increases the reliability and safety of autonomous vehicles and drone delivery systems in complex, real-world environments.

Augmented Reality (AR) & Virtual Reality (VR)

More realistic real-time occlusion of virtual objects by real-world moving objects, improving dynamic scene understanding for interactive AR experiences.

Creates more immersive and believable AR/VR experiences, blurring the lines between digital and physical interactions.

Smart City & Public Safety

Advanced anomaly detection and precise tracking of specific individuals or objects in crowded public spaces, enabling smart traffic management and proactive incident response.

Boosts public safety, optimizes urban planning, and improves situational awareness with fewer false positives in surveillance systems.

Sports Analytics & Biomechanics

Fine-grained 3D tracking of players and objects (e.g., ball) in sports, capturing instantaneous velocities and trajectories for performance analysis and tactical insights.

Revolutionizes training, game strategy, and injury prevention by providing unprecedented detailed data on movement dynamics.

Back to Research Lab Read full paper