MDA: Modeling Depth Ambiguity for Flying-Point-Free Depth Estimation

DA3

Ours

Input

Drag the divider to compare DA3 (left) against Ours (right). Click any thumbnail to switch the scene. Our model predicts clean depth boundaries without flying points.

TL;DR: MDA introduces a mixture-density depth representation that resolves depth ambiguity at blurry object boundaries, largely eliminates flying points, adds negligible overhead, and applies across different backbones. It also extends naturally to other forms of depth ambiguity, including transparent objects and sky.

What Are Flying Points?

Flying points are 3D points stranded between a foreground object and the background. They appear along object edges and corrupt the 3D reconstruction; in the comparison below, the baselines are full of them while MDA stays clean.

Bigger backbones (DA3, VGGT) don't remove them, and neither does PPD — a slow denoising-style refinement on top of an existing predictor. Swap the middle column to see all three still scatter flying points between surfaces.

Input

DA3

Ours

Compare against:

Why Flying Points Happen

These artifacts come from a standard assumption: one depth per pixel. On flat regions this is fine. Near object edges, the pixel mixes evidence from both foreground and background, and the model can't tell which surface it sits on. So its true depth is ambiguous.

A one-depth model cannot represent this ambiguity. Its unimodal loss always demands a single answer, so training pulls the prediction to a value between the foreground and background depths. In 3D, that value lies on neither surface and becomes a floating point.

input frame with depth-edges overlaid in red — The 3D point cloud of a baseline (DA3-GIANT) prediction. Click any red depth edge in the input, and the corresponding pixels light up in orange in the 3D viewer, making the **flying points** more obvious. The two zoom panels show the same pixel window in RGB and predicted depth.

Our Fix: A Mixture-Density Representation

MDA replaces the single-value depth output with a mixture density: each pixel predicts a few depth hypotheses, each with a probability. On flat regions, the hypotheses agree — just like a standard depth model. At an edge, they separate: one matches the foreground, another the background.

To produce a final depth, MDA picks one hypothesis rather than averaging them. Even with imperfect probabilities, the output is always one of the predictions, so it lands on a real surface and is flying-point-free. Only the final layer changes, so the same head plugs into DA3 or VGGT with negligible overhead.

Unimodal vs. mixture-density depth at an object boundary. — **(a)** A boundary pixel covers both a foreground surface at $d_1$ and a background surface at $d_2$. **(b)** A unimodal predictor outputs a single value $d_3$ between $d_1$ and $d_2$, on neither surface, resulting in a flying point. **(c)** MDA's mixture has one hypothesis per surface; decoding picks one.

Look Inside a Pixel

In MDA, edge pixels carry a multi-peaked distribution. Click near an edge to see the predicted mixture density: one peak per candidate surface. MDA selects one of these peaks, so its final output never falls between surfaces.

Click a point on the image

input frame with depth-boundary pixels highlighted in red — Input, with boundaries tinted red

Depth distribution at the selected pixel

Point cloud · drag to orbit

— pts

3D point cloud of MDA's predicted depth. The clicked boundary pixel is marked with an red dot, and its surrounding image window lights up in orange.

Input · pick a sample

Robustness to Input Blur

Blur increases depth ambiguity at object edges: more pixels contain evidence from both foreground and background surfaces. As blur grows, baseline depth quality drops sharply, while MDA remains stable and nearly flying-point-free. Filtering the baselines' lowest-confidence pixels, the standard fix, does not remove all flying points either.

Clean Blurry s = 8

Input

DA3 vs Ours

VGGT vs Ours

s = 8More blurry

DA3

Ours

VGGT

Ours

Input · pick a scene

Chamfer distance vs. input blur level. — Chamfer distance (CD, lower is better)

Pred-to-GT distances (Acc) vs. input blur level. — Pred-to-GT distances (Acc, lower is better)

Across all blur levels, the baselines remain consistently worse than MDA, and their performance drops more sharply as input blur grows.

Extension Beyond Boundaries

The same mixture density handles other forms of depth ambiguity. For glass, a single camera ray can reach multiple surfaces, so one pixel may have multiple valid depths; the mixture preserves them instead of collapsing them. For sky, MDA adds a dedicated depth component pinned at an effectively infinite depth, letting sky pixels separate cleanly from finite-depth surfaces.

Sky: A Dedicated Component, Threshold-Free Skylines

A pixel is labelled sky when its dedicated sky component has the highest weight. The sky pixels are pinned to the component's fixed, effectively infinite depth, generating no flying points.

The **baseline** unprojects sky pixels at their predicted finite depth, scattering them as a noisy near-field shell. MDA's dedicated **sky component** pushes those pixels to a far plane, so the visible geometry remains intact and the skyline stays flying-point-free. Drag either point-cloud panel in a row to rotate both.

Transparent Objects: Layered Depth From One Image

Through glass, a pixel can have more than one real depth: the glass surface and what lies behind it. MDA keeps both hypotheses active.

First layer: the glass. Last layer: the surface behind it. Mask: pixels where both layers remain active.

Input

Layer 1 (front)

Layer Last (behind)

Segmentation

BibTeX

@misc{bian2026modeling,
  title         = {Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation},
  author        = {Siyuan Bian and Congrong Xu and Jun Gao},
  year          = {2026},
  eprint        = {2606.02552},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.02552}
}