Abstract
We present a model that uses a single first-person image to generate an egocentric basketball motion sequence
in the form of a 12D camera configuration trajectory, which
encodes a player’s 3D location and 3D head orientation
throughout the sequence. To do this, we first introduce a
future convolutional neural network (CNN) that predicts an
initial sequence of 12D camera configurations, aiming to
capture how real players move during a one-on-one basketball game. We also introduce a goal verifier network, which
is trained to verify that a given camera configuration is consistent with the final goals of real one-on-one basketball
players. Next, we propose an inverse synthesis procedure
to synthesize a refined sequence of 12D camera configurations that (1) sufficiently matches the initial configurations
predicted by the future CNN, while (2) maximizing the output of the goal verifier network. Finally, by following the
trajectory resulting from the refined camera configuration
sequence, we obtain the complete 12D motion sequence.
Our model generates realistic basketball motion sequences that capture the goals of real players, outperforming standard deep learning approaches such as recurrent
neural networks (RNNs), long short-term memory networks
(LSTMs), and generative adversarial networks (GANs).