Abstract
The ability to identify and temporally segment finegrained human actions throughout a video is crucial for
robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local
spatiotemporal features from video frames and then feeding them into a temporal classifier that captures highlevel temporal patterns. We describe a class of temporal
models, which we call Temporal Convolutional Networks
(TCNs), that use a hierarchy of temporal convolutions to
perform fine-grained action segmentation or detection. Our
Encoder-Decoder TCN uses pooling and upsampling to ef-
ficiently capture long-range temporal patterns whereas our
Dilated TCN uses dilated convolutions. We show that TCNs
are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements
over the state of the art