TL;DR: MDA introduces a mixture-density depth representation that resolves depth ambiguity at blurry object boundaries, largely eliminates flying points, adds negligible overhead, and applies across different backbones. It also extends naturally to other forms of depth ambiguity, including transparent objects and sky.
These artifacts come from a standard assumption: one depth per pixel. On flat regions this is fine. Near object edges, the pixel mixes evidence from both foreground and background, and the model can't tell which surface it sits on. So its true depth is ambiguous.
A one-depth model cannot represent this ambiguity. Its unimodal loss always demands a single answer, so training pulls the prediction to a value between the foreground and background depths. In 3D, that value lies on neither surface and becomes a floating point.
MDA replaces the single-value depth output with a mixture density: each pixel predicts a few depth hypotheses, each with a probability. On flat regions, the hypotheses agree — just like a standard depth model. At an edge, they separate: one matches the foreground, another the background.
To produce a final depth, MDA picks one hypothesis rather than averaging them. Even with imperfect probabilities, the output is always one of the predictions, so it lands on a real surface and is flying-point-free. Only the final layer changes, so the same head plugs into DA3 or VGGT with negligible overhead.
In MDA, edge pixels carry a multi-peaked distribution. Click near an edge to see the predicted mixture density: one peak per candidate surface. MDA selects one of these peaks, so its final output never falls between surfaces.
Blur increases depth ambiguity at object edges: more pixels contain evidence from both foreground and background surfaces. As blur grows, baseline depth quality drops sharply, while MDA remains stable and nearly flying-point-free. Filtering the baselines' lowest-confidence pixels, the standard fix, does not remove all flying points either.
Across all blur levels, the baselines remain consistently worse than MDA, and their performance drops more sharply as input blur grows.
The same mixture density handles other forms of depth ambiguity. For glass, a single camera ray can reach multiple surfaces, so one pixel may have multiple valid depths; the mixture preserves them instead of collapsing them. For sky, MDA adds a dedicated depth component pinned at an effectively infinite depth, letting sky pixels separate cleanly from finite-depth surfaces.
A pixel is labelled sky when its dedicated sky component has the highest weight. The sky pixels are pinned to the component's fixed, effectively infinite depth, generating no flying points.
Through glass, a pixel can have more than one real depth: the glass surface and what lies behind it. MDA keeps both hypotheses active.
First layer: the glass. Last layer: the surface behind it. Mask: pixels where both layers remain active.
@article{bian2026mda,
title = {Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation},
author = {Bian, Siyuan and Xu, Congrong and Gao, Jun},
journal = {arXiv preprint},
year = {2026}
}