HOI-TG: End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Zhenrong Wang¹, Qi Zheng¹, Sihan Ma², Maosheng Ye³, Yibing Zhan⁴, Dongjiang Li⁴

¹Shenzhen University, ²University of Sydney, ³DeepRoute.AI, ⁴JD Explore Academy

CVPR2025 Highlight

arXiv Code

HOI-TG is an end-to-end transformer framework for 3D human-object interaction (HOI) reconstruction from a single image. It innovatively utilizes self-attention to implicitly model the contact between humans and objects. The model achieves state-of-the-art performance on the BEHAVE and InterCap datasets, improving human and object reconstruction accuracy by 8.9% and 8.6% on InterCap, respectively. This demonstrates the robust integration of global posture and fine-grained interaction modeling without explicit constraints.

Several studies integrate interaction representations for joint human-object reconstruction:

(i) StackFLOW models spatial relationships using human-object offsets from surface anchors.

(ii) CHORE predicts a part correspondence field to identify contact points.

(iii) CONTHO estimates vertex-level contact maps to mitigate erroneous correlations.

While contact constraints aid HOI reconstruction, they introduce a conflict: global positioning is key for mesh reconstruction, whereas interaction constraints emphasize local relationships. Balancing these is challenging—e.g., StackFLOW requires costly post-optimization for quality improvement. To overcome this, we propose a transformer-based framework that implicitly integrates interaction-aware reconstruction without explicit constraints.

HOI-TG outperforms previous models in terms of mesh reconstruction, contact reconstruction accuracy, and running speed. The results demonstrate that HOI-TG achieves better global mesh reconstruction and higher-quality contact areas. Such improvements directly showcase the effectiveness of our straightforward transformer encoder and graph convolutional structures, which implicitly learn the interactions between humans and objects.

We use the average of attention values from each human vertex to all object vertices as the attention value of that human vertex towards the object.
For simple and static interactions, the positions of objects only relate to local body parts. For instance, the positions of a chair, table, monitor, and basketball solely depend on the locations of the interacting body parts.
In contrast, for complex interactions, such as sitting on a chair while using a keyboard or moving with a suitcase, our model successfully attends to non-local body parts and leverages non-local vertices to predict the positions of the objects.

Reconstruction quality results on BEHAVE and InterCap datasets

Abstract

With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

BibTeX

BibTex Code Here

This page was built using the Academic Project Page Template which was adopted from the Nerfies project page. You are free to borrow the source code of this website, we just ask that you link back to this page in the footer.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.