Abstract
We are interested in enabling automatic 4D cinema
by parsing physical and special effects from untrimmed
movies. These include effects such as physical interactions,
water splashing, light, and shaking, and are grounded to either a character in the scene or the camera. We collect a
new dataset referred to as the Movie4D dataset which annotates over 9K effects in 63 movies. We propose a Conditional Random Field model atop a neural network that
brings together visual and audio information, as well as semantics in the form of person tracks. Our model further
exploits correlations of effects between different characters
in the clip as well as across movie threads. We propose effect detection and classification as two tasks, and present
results along with ablation studies on our dataset, paving
the way towards 4D cinema in everyone’s homes