Spatio-Temporal Graph Diffusion for Text-Driven Human Motion Generation

Text-based human motion generation is challenging due to the complexity and context-dependency of natural human motions. In recent years, an increasing number of studies have focused on using transformer-based diffusion models to tackle this issue. However, an over-reliance on transformers has resulted in a lack of adequate detail in the generated motions. This study proposes a novel graph network-based diffusion model to address this challenging problem. Specifically, we use spatio-temporal graphs to capture local details for each node and an auxiliary transformer to aggregate the information across all nodes. In addition, the transformer is also used to process conditional global information that is difficult to handle with graph networks. Our model achieves competitive results on currently the largest dataset HumanML3D and outperforms existing diffusion models in terms of FID and diversity, demonstrating the advantages of graph neural networks in modeling human motion data.

STGMD Overview

Sampling Process

The input to the diffusion model consists of a spatio- temporal graph $x_t$ , the noising time step $t$ , and the corresponding text description $c$ . The STG-UNet extracts the fine-grained local details of human motion, while the frozen CLIP extracts text embeddings. In addition to aggregating local details, the trans- former processes text embeddings $c$ and time $t$ .

Starting with random Gaussian noise, a text description $c$ and diffusion step $T$ , STGMD gradually anneals the noise to a sample $x^0$ . At each time step $t$ , our model predicts the sample's initial state $x^0$ and then diffuses it back to $x^{t-1}$ . By repeating these operations $T$ times, the diffusion process finally yields the sample $x^0$ . The figure only shows one frame of the graph.

STG-UNet Overview

STG-UNet Overview. The input is a spatio-temporal graph $X = (V, A)$ , which consists of a vertex matrix $V$ and an adjacency matrix $A$ . The STG-Block processes the vertex matrix and transforms it into a graph representation. The pooling operation is then applied to select nodes for the coarser scale. This process is repeated for a coarser scale. In reverse, the unpooling operation refills the previously excluded nodes with empty nodes for finer scales. Additionally, skip connections are employed to connect graph representations on the same scale.