We use the average of attention values from each human vertex to all object vertices as the attention value of that human vertex towards the object.
For simple and static interactions, the positions of objects only relate to local body parts. For instance, the positions of a chair, table, monitor, and basketball solely depend on the locations of the interacting body parts.
In contrast, for complex interactions, such as sitting on a chair while using a keyboard or moving with a suitcase, our model successfully attends to non-local body parts and leverages non-local vertices to predict the positions of the objects.