Spatio-Temporal Graph Diffusion for Text-Driven Human Motion Generation

BMVC 2023

Paper Code

Abstract


Text-based human motion generation is challenging due to the complexity and context-dependency of natural human motions. In recent years, an increasing number of studies have focused on using transformer-based diffusion models to tackle this issue. However, an over-reliance on transformers has resulted in a lack of adequate detail in the generated motions. This study proposes a novel graph network-based diffusion model to address this challenging problem. Specifically, we use spatio-temporal graphs to capture local details for each node and an auxiliary transformer to aggregate the information across all nodes. In addition, the transformer is also used to process conditional global information that is difficult to handle with graph networks. Our model achieves competitive results on currently the largest dataset HumanML3D and outperforms existing diffusion models in terms of FID and diversity, demonstrating the advantages of graph neural networks in modeling human motion data.

Approach


STGMD Overview

Sampling Process

net_1.jpg
net_2.jpg
The input to the diffusion model consists of a spatio- temporal graph x_t, the noising time step t, and the corresponding text description c. The STG-UNet extracts the fine-grained local details of human motion, while the frozen CLIP extracts text embeddings. In addition to aggregating local details, the trans- former processes text embeddings c and time t.

Starting with random Gaussian noise, a text description c and diffusion step T , STGMD gradually anneals the noise to a sample x^0. At each time step t, our model predicts the sample's initial state x^0 and then diffuses it back to x^{t-1}. By repeating these operations T times, the diffusion process finally yields the sample x^0. The figure only shows one frame of the graph.


STG-UNet Overview

img_2.jpg

STG-UNet Overview. The input is a spatio-temporal graph X = (V, A), which consists of a vertex matrix V and an adjacency matrix A. The STG-Block processes the vertex matrix and transforms it into a graph representation. The pooling operation is then applied to select nodes for the coarser scale. This process is repeated for a coarser scale. In reverse, the unpooling operation refills the previously excluded nodes with empty nodes for finer scales. Additionally, skip connections are employed to connect graph representations on the same scale.

Results


Quantitative results on the HumanML3D test set

table.jpg

Visual Results

27.gif 27.gif 27.gif

A person is performing a rhythmic dance routine with a ribbon.

03.gif

A person is practicing their figure skating routine.

03.gif

Someone is practicing their karate kicks.

03.gif

A person is doing a series of sit-ups on a stability ball.

03.gif

Someone is jumping over a hurdle.

03.gif

A person is doing a series of chin-ups.

03.gif

Someone is practicing their tightrope walking.

03.gif

A person is performing a series of bicep curls with dumbbells.

03.gif

Someone is practicing their agility ladder drills.

03.gif

A person is doing a series of squats.

© This webpage was in part inspired from this template.