HOI-TG: End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Zhenrong Wang1, Qi Zheng1, Sihan Ma2, Maosheng Ye3, Yibing Zhan4, Dongjiang Li4
1Shenzhen University, 2University of Sydney, 3DeepRoute.AI, 4JD Explore Academy

CVPR2025 Highlight

HOI-TG is an end-to-end transformer framework for 3D human-object interaction (HOI) reconstruction from a single image. It innovatively utilizes self-attention to implicitly model the contact between humans and objects. The model achieves state-of-the-art performance on the BEHAVE and InterCap datasets, improving human and object reconstruction accuracy by 8.9% and 8.6% on InterCap, respectively. This demonstrates the robust integration of global posture and fine-grained interaction modeling without explicit constraints.

Several studies integrate interaction representations for joint human-object reconstruction:

  • (i) StackFLOW models spatial relationships using human-object offsets from surface anchors.
  • (ii) CHORE predicts a part correspondence field to identify contact points.
  • (iii) CONTHO estimates vertex-level contact maps to mitigate erroneous correlations.

While contact constraints aid HOI reconstruction, they introduce a conflict: global positioning is key for mesh reconstruction, whereas interaction constraints emphasize local relationships. Balancing these is challenging—e.g., StackFLOW requires costly post-optimization for quality improvement. To overcome this, we propose a transformer-based framework that implicitly integrates interaction-aware reconstruction without explicit constraints.

HOI-TG outperforms previous models in terms of mesh reconstruction, contact reconstruction accuracy, and running speed. The results demonstrate that HOI-TG achieves better global mesh reconstruction and higher-quality contact areas. Such improvements directly showcase the effectiveness of our straightforward transformer encoder and graph convolutional structures, which implicitly learn the interactions between humans and objects.

    We use the average of attention values from each human vertex to all object vertices as the attention value of that human vertex towards the object.

    For simple and static interactions, the positions of objects only relate to local body parts. For instance, the positions of a chair, table, monitor, and basketball solely depend on the locations of the interacting body parts.

   In contrast, for complex interactions, such as sitting on a chair while using a keyboard or moving with a suitcase, our model successfully attends to non-local body parts and leverages non-local vertices to predict the positions of the objects.

Reconstruction quality results on BEHAVE and InterCap datasets

Abstract

With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

BibTeX

BibTex Code Here