intermediate

8 min read

•Wednesday, March 25, 2026

OccAny: The AI That Sees the World in 3D, Anywhere, Anytime

Imagine AI agents that can understand complex 3D environments from any camera, without needing specialized sensors or meticulous calibration. OccAny is a groundbreaking model that promises to unlock generalized 3D occupancy prediction, allowing developers to build sophisticated spatial AI applications like never before.

Original paper: 2603.23502v1

Authors:Anh-Quan CaoTuan-Hung Vu

Key Takeaways

1. OccAny is the first generalized 3D occupancy model that works on unconstrained, out-of-domain, and uncalibrated urban scenes.
2. It provides metric 3D occupancy and segmentation features from diverse inputs (monocular, sequential, surround-view images).
3. Key innovations include 'Segmentation Forcing' for improved quality and 'Novel View Rendering' for geometry completion.
4. OccAny significantly reduces reliance on expensive sensors and extensive calibration, making 3D perception more accessible.
5. The model opens up new possibilities for AI agents in robotics, AR/VR, smart cities, and autonomous systems by providing robust, flexible 3D environmental understanding.

Why Generalized 3D Perception Matters for Developers

As AI builders, we're constantly pushing the boundaries of what intelligent systems can perceive and interact with. In the realm of computer vision, 3D scene understanding is the holy grail. Whether you're developing autonomous vehicles, sophisticated robotics, immersive AR/VR experiences, or even advanced simulations, a robust understanding of the physical world in three dimensions is paramount.

However, traditional methods for 3D occupancy prediction—knowing what space is occupied by objects—have been notoriously rigid. They often demand highly calibrated sensors (think expensive LiDAR arrays), rely on vast amounts of in-domain annotated data, and struggle to generalize to new, uncalibrated environments. This creates significant barriers to entry and limits the scalability of our AI solutions.

Enter OccAny. This new research from Anh-Quan Cao and Tuan-Hung Vu presents a paradigm shift: a generalized, unconstrained urban 3D occupancy model. For developers, this means the potential to equip AI agents with universal 3D perception, vastly expanding the scope and flexibility of what we can build.

The Paper in 60 Seconds: OccAny's Core Breakthroughs

OccAny: Generalized Unconstrained Urban 3D Occupancy tackles the limitations of current 3D vision systems. Instead of needing specific sensors or calibration, OccAny can predict and complete metric 3D occupancy (a precise understanding of occupied space) from diverse inputs: single images, sequential video frames, or surround-view camera setups. It even works with out-of-domain and uncalibrated scenes, which is a massive leap forward.

The key innovations are:

• A generalized 3D occupancy framework that adapts to various visual inputs and environments.

• Segmentation Forcing, which significantly improves the quality of occupancy predictions and enables mask-level understanding of objects.

• A Novel View Rendering pipeline that infers missing geometry from different viewpoints, making the 3D representation more complete and robust.

In essence, OccAny allows AI to 'see' and 'understand' the 3D structure of urban environments with unprecedented flexibility and accuracy, paving the way for truly adaptive intelligent systems.

The Problem with Current 3D Vision: A Developer's Frustration

Imagine you're building an autonomous robot for a new factory floor. Current 3D perception models would likely demand:

• Specialized Sensor Rigs: You'd need specific LiDARs, depth cameras, or multi-camera setups, all precisely calibrated to each other.

• Extensive Data Collection: You'd have to collect massive datasets *from that specific factory floor* and meticulously annotate them for training.

• Limited Generalization: If you move the robot to a slightly different factory, or even just change the lighting, the model might fail because it's too tied to its training domain.

This makes rapid deployment, cost-effectiveness, and real-world adaptability incredibly challenging. The dream of AI agents seamlessly navigating any environment has been hampered by these practical constraints.

Enter OccAny: Universal 3D Perception for Every Camera

OccAny directly addresses these pain points, offering a solution that's both powerful and remarkably versatile.

Breaking Free from Calibration and Domain Constraints

The most significant contribution of OccAny is its ability to operate on out-of-domain uncalibrated scenes. This means you don't need to know the precise angles or positions of your cameras, nor do your cameras need to be high-end depth sensors. Standard monocular (single) cameras, common in surveillance systems, smartphones, or dashcams, can now be leveraged for sophisticated 3D understanding.

Metric Occupancy with Semantic Richness

OccAny doesn't just tell you *something* is there; it provides metric occupancy. This means it understands the actual size, shape, and precise location of objects in 3D space. Coupled with segmentation features, it can also identify *what* those objects are (e.g., a car, a pedestrian, a building). This combination is crucial for intelligent decision-making, allowing an AI agent to not only avoid an obstacle but also understand if it's a static wall or a moving person.

The Power of Segmentation Forcing

One of OccAny's clever tricks is Segmentation Forcing. This technique integrates semantic segmentation (identifying objects pixel by pixel) directly into the 3D occupancy prediction process. By forcing the model to also understand the semantic boundaries of objects, it significantly improves the accuracy and completeness of the 3D occupancy map. Think of it as giving the model a stronger hint about where objects begin and end, leading to cleaner, more precise 3D reconstructions.

Completing the Picture with Novel View Rendering

Real-world scenes are messy. Objects can be partially occluded, and a single view might not capture all geometry. OccAny's Novel View Rendering pipeline addresses this by inferring geometry from synthetic novel viewpoints. During inference, if the model struggles with a particular area, it can 'imagine' what that area would look like from a different angle to complete the missing 3D information. This is like having an intelligent fill-in-the-blanks mechanism for 3D space, making its predictions more robust and comprehensive, especially in cluttered urban environments.

Versatility Across Input Settings

Whether you have a single camera, a video stream, or a full 360-degree surround-view setup, OccAny is designed to work. This input flexibility makes it incredibly adaptable to a wide range of real-world deployment scenarios, from a drone's single forward-facing camera to a smart city's network of varied surveillance feeds.

What Can You Build with OccAny? Practical Applications for Developers

OccAny isn't just an academic achievement; it's a powerful tool that opens up new possibilities for developers and AI builders across industries.

• Autonomous Navigation (Beyond Self-Driving Cars): Imagine drones navigating complex, unknown industrial sites or disaster zones using only standard cameras, building real-time 3D maps to avoid obstacles and identify targets. Or warehouse robots dynamically re-planning routes as goods move, without needing a pre-mapped environment.

• AR/VR and Metaverse Construction: Create highly accurate, real-time 3D reconstructions of a user's physical environment using just a smartphone camera. This allows for hyper-realistic digital overlays, interactive games that truly understand the physical space, or instant generation of digital twins for virtual worlds.

• Smart City and Infrastructure Monitoring: Leverage existing city surveillance cameras to automatically generate and update 3D models of urban infrastructure, detect changes, monitor traffic flow in 3D, or identify potential hazards without expensive LiDAR installations.

• Robotics and Industrial Automation: Equip robotic arms or mobile manipulators with a deeper, more generalized understanding of their workspace, allowing them to adapt to new tasks or environments more quickly and safely, even with varied lighting or object placements.

• AI Agent Orchestration (Soshilabs Perspective): For companies like Soshilabs, OccAny provides a critical foundational layer. AI agents, whether managing a smart building, coordinating a logistics network, or operating in a simulated environment, require a robust and generalized understanding of their physical surroundings. OccAny can furnish this 3D perception, allowing agents to make more informed decisions, interact more intelligently with their environment, and operate autonomously in highly dynamic real-world scenarios.

Conclusion

OccAny represents a significant leap towards truly generalized 3D perception. By freeing developers from the constraints of specialized sensors and domain-specific training, it democratizes access to sophisticated 3D scene understanding. This isn't just about better occupancy maps; it's about enabling a new generation of AI applications that are more flexible, scalable, and capable of interacting intelligently with our complex 3D world. The future of spatial AI looks brighter, and it's built on the ability to see anywhere, anytime.

Curious to dive deeper? Check out the code on GitHub: [https://github.com/valeoai/OccAny](https://github.com/valeoai/OccAny)

Cross-Industry Applications

Robotics & Logistics

Autonomous warehouse robots navigating dynamic, unstructured environments using standard security cameras, or delivery drones mapping unknown terrains on the fly.

Reduces reliance on costly LiDAR/specialized sensors, increases operational flexibility and deployment speed in diverse settings.

Augmented Reality (AR) / Metaverse

Real-time 3D reconstruction of a user's physical environment from their phone camera, allowing for seamless blending of digital objects (e.g., trying on virtual furniture, interactive games in real spaces).

Enables more immersive, realistic, and context-aware AR experiences without requiring specialized depth sensors, broadening AR accessibility.

DevTools / AI Agent Orchestration

Providing AI agents (e.g., for facility management, smart city operations, or even game NPCs) with a robust, generalized 3D perception layer derived from existing visual feeds, allowing them to understand and interact with complex, real-world environments.

Empowers AI agents to operate more autonomously and intelligently in diverse physical or simulated spaces, reducing human oversight and improving decision-making based on dynamic 3D context.

Smart City / Urban Planning

Automatically generating and updating high-fidelity 3D digital twins of urban areas or industrial facilities using existing surveillance cameras or dashcam footage, without manual calibration.

Streamlines infrastructure monitoring, urban development simulations, and maintenance planning with up-to-date, accurate 3D models, enabling better resource allocation and predictive analytics.

Back to Research Lab Read full paper