UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation

1Noah's Ark Lab, Huawei Technologies, 2University of Toronto
CoRL 2025

Abstract

Estimating the 6D pose of novel objects is a fundamental yet challenging problem in robotics, often relying on access to object CAD models. However, acquiring such models can be costly and impractical. Recent approaches aim to bypass this requirement by leveraging strong priors from foundation models to reconstruct objects from single or multi-view images, but typically require additional training or produce hallucinated geometry. To this end, we propose UnPose, a novel framework for zero-shot, model-free 6D object pose estimation and reconstruction that exploits 3D priors and uncertainty estimates from a pre-trained diffusion model. Specifically, starting from a single-view RGB-D frame, UnPose uses a multi-view diffusion model to estimate an initial 3D model using 3D Gaussian Splatting (3DGS) representation, along with pixel-wise epistemic uncertainty estimates. As additional observations become available, we incrementally refine the 3DGS model by fusing new views guided by the diffusion model's uncertainty, thereby, continuously improving the pose estimation accuracy and 3D reconstruction quality. To ensure global consistency, the diffusion prior-generated views and subsequent observations are further integrated in a pose graph and jointly optimized into a coherent 3DGS field. Extensive experiments demonstrate that UnPose significantly outperforms existing approaches in both 6D pose estimation accuracy and 3D reconstruction quality. We further showcase its practical applicability in real-world robotic manipulation tasks.

Method Overview

method overview

Starting from a single-view RGB-D frame, UnPose uses a multi-view diffusion model to estimate an initial 3D model using 3D Gaussian Splatting (3DGS) representation, along with pixel-wise epistemic uncertainty estimates. As additional observations become available, we incrementally refine the 3DGS model by fusing new views guided by the diffusion model's uncertainty, thereby, continuously improving the pose estimation accuracy and 3D reconstruction quality. We provide a schematic overview of the initialization module.

Initialization from First Frame

MY ALT TEXT

Visualization of the uncertainty estimates of the images produced by the diffusion model during the initialization phase.

UnPose synthesizes multiview diffusion images with uncertainty estimates using the first frame. The initial reconstruction, represented as 3DGS, is refined to metric scale using real-world observations from the first frame, thereby, initializing the pose graph which is metrically consistent while still preseving the uncertainty estimates.

Incorporating Future Frames

Pose estimation is performed on the initial optimized 3DGS. The initial 6DOF pose is incrementally refined by incorporating the subsequent frames in the pose graph.

Results

Real Robot Experiment

Demonstration of UnPose in a real-world robotic manipulation task.

BibTeX

If you found our work useful in your research, please consider citing:

      @inproceedings{jiang2025unpose,
      title={UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation},
      author={Jiang, Zhaodong and Sinha, Ashish and Cao, Tongtong and Ren, Yuan and Liu, Bingbing and Xu, Binbin},
      booktitle={Conference on Robot Learning (CoRL)},
      year={2025}