# Foundation Vision Models - Machine Perception

**Foundation Vision Models** are large-scale AI models pre-trained on vast amounts of visual data. They serve as a general-purpose "visual cortex" for a robot, providing a fundamental understanding of the physical world from sensor inputs like cameras. These models can be fine-tuned for specific downstream tasks with minimal additional training. They are often categorized into **Large Geospatial Models** and **Foundation Geometric Models**, which understand environments, and excel at interpreting the 3D structure of scenes.

Key downstream tasks enabled by these models include:

* **Visual Relocalization:** Pinpointing a robot’s precise position and orientation (x,y,z,θ,ϕ,ψ) within a known or even an unknown environment using only camera imagery. This is crucial for navigation when GPS is unavailable or inaccurate, like indoors.
* **Depth Estimation (Metric Scaling):** Accurately calculating the physical distance to objects from 2D images. This allows a robot to understand the scale and dimensions of its surroundings, turning a flat image into a quantifiable 3D space.
* **3D Reconstruction:** Generating detailed and geometrically accurate 3D models of objects, rooms, or outdoor scenes from one or more images (monocular or binocular views). This creates a digital representation the robot can use for path planning and interaction.
* **Semantic 3D Visual Segmentation:** Identifying, classifying, and segmenting different objects and structures within a 3D reconstruction. The model doesn't just see a "lump" of points; it understands "this is a chair," "this is a table," and "this is the floor," assigning meaning to the geometry.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.overthereality.ai/over-wiki/physical-ai/physical-ai-foundation-models-for-robotics/foundation-vision-models-machine-perception.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
