Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance

TL;DR We guide a latent diffusion model with multi-modal cues derived from the input hand-object image and several foundation models to reconstruct hand-object interactions in 3D.

Interactive Results

Coffee machine

Drag to rotate, scroll/trackpad to zoom. Click any example on the left to load its mesh.

Abstract

We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.

Overview

Overview of FollowMyHold. Given a single RGB frame, we (1) isolate the interaction region and derive binary hand/object masks with LangSAM and WiLoR's hand detector; (2) inpaint the occluded object appearance using FLUX.1 Kontext + Gemini (§4.1). Next, we obtain three complementary 3D cues: a HaMeR hand mesh, a MoGe-2 partial point cloud (with camera pose

\phi

), and a coarse Hunyuan3D-2 HOI mesh. A two-step rigid alignment registers these cues into a common, image-aligned frame. A two-step ICP registers all cues into a common image-aligned frame. Finally, we perform inference-time guidance with a staged optimization (§4.2): Phase~1 optimizes the hand transform

T_h

; Phase~2 optimizes the object transform

T_o

and guides the velocity field; Phase~3 jointly refines

(T_h,T_o)

while guiding with pixel-aligned 2D losses (

G_{2\text{D}}

: normals, disparity, silhouette) and 3D constraints (

G_{3\text{D}}

: intersection, proximity). The right bottom row shows progressive object refinement over diffusion steps.

Citation

@misc{aytekin2025followholdhandobjectinteraction,
    title={Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance}, 
    author={Ayce Idil Aytekin and Helge Rhodin and Rishabh Dabral and Christian Theobalt},
    year={2025},
    eprint={2508.18213},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2508.18213}, 
}