Abstract
Video sequences contain rich dynamic patterns, such as
dynamic texture patterns that exhibit stationarity in the temporal domain, and action patterns that are non-stationary in
either spatial or temporal domain. We show that a spatialtemporal generative ConvNet can be used to model and synthesize dynamic patterns. The model defines a probability
distribution on the video sequence, and the log probability
is defined by a spatial-temporal ConvNet that consists of
multiple layers of spatial-temporal filters to capture spatialtemporal patterns of different scales. The model can be
learned from the training video sequences by an “analysis
by synthesis” learning algorithm that iterates the following two steps. Step 1 synthesizes video sequences from the
currently learned model. Step 2 then updates the model parameters based on the difference between the synthesized
video sequences and the observed training sequences. We
show that the learning algorithm can synthesize realistic
dynamic patterns.