# Foundation Vision Models - Machine Perception

**Foundation Vision Models** are large-scale AI models pre-trained on vast amounts of visual data. They serve as a general-purpose "visual cortex" for a robot, providing a fundamental understanding of the physical world from sensor inputs like cameras. These models can be fine-tuned for specific downstream tasks with minimal additional training. They are often categorized into **Large Geospatial Models** and **Foundation Geometric Models**, which understand environments, and excel at interpreting the 3D structure of scenes.

Key downstream tasks enabled by these models include:

* **Visual Relocalization:** Pinpointing a robot’s precise position and orientation (x,y,z,θ,ϕ,ψ) within a known or even an unknown environment using only camera imagery. This is crucial for navigation when GPS is unavailable or inaccurate, like indoors.
* **Depth Estimation (Metric Scaling):** Accurately calculating the physical distance to objects from 2D images. This allows a robot to understand the scale and dimensions of its surroundings, turning a flat image into a quantifiable 3D space.
* **3D Reconstruction:** Generating detailed and geometrically accurate 3D models of objects, rooms, or outdoor scenes from one or more images (monocular or binocular views). This creates a digital representation the robot can use for path planning and interaction.
* **Semantic 3D Visual Segmentation:** Identifying, classifying, and segmenting different objects and structures within a 3D reconstruction. The model doesn't just see a "lump" of points; it understands "this is a chair," "this is a table," and "this is the floor," assigning meaning to the geometry.
