👁️Foundation Vision Models - Machine Perception

Foundation Vision Models are large-scale AI models pre-trained on vast amounts of visual data. They serve as a general-purpose "visual cortex" for a robot, providing a fundamental understanding of the physical world from sensor inputs like cameras. These models can be fine-tuned for specific downstream tasks with minimal additional training. They are often categorized into Large Geospatial Models and Foundation Geometric Models, which understand environments, and excel at interpreting the 3D structure of scenes.

Key downstream tasks enabled by these models include:

  • Visual Relocalization: Pinpointing a robot’s precise position and orientation (x,y,z,θ,ϕ,ψ) within a known or even an unknown environment using only camera imagery. This is crucial for navigation when GPS is unavailable or inaccurate, like indoors.

  • Depth Estimation (Metric Scaling): Accurately calculating the physical distance to objects from 2D images. This allows a robot to understand the scale and dimensions of its surroundings, turning a flat image into a quantifiable 3D space.

  • 3D Reconstruction: Generating detailed and geometrically accurate 3D models of objects, rooms, or outdoor scenes from one or more images (monocular or binocular views). This creates a digital representation the robot can use for path planning and interaction.

  • Semantic 3D Visual Segmentation: Identifying, classifying, and segmenting different objects and structures within a 3D reconstruction. The model doesn't just see a "lump" of points; it understands "this is a chair," "this is a table," and "this is the floor," assigning meaning to the geometry.

Last updated