Abstract. State-of-the-art video restoration methods integrate optical
flow estimation networks to utilize temporal information. However, these
networks typically consider only a pair of consecutive frames and hence
are not capable of capturing long-range temporal dependencies and fall
short of establishing correspondences across several timesteps. To alleviate these problems, we propose a novel Spatio-temporal Transformer
Network (STTN) which handles multiple frames at once and thereby
manages to mitigate the common nuisance of occlusions in optical flow
estimation. Our proposed STTN comprises a module that estimates optical flow in both space and time and a resampling layer that selectively
warps target frames using the estimated flow. In our experiments, we
demonstrate the efficiency of the proposed network and show state-ofthe-art restoration results in video super-resolution and video deblurring.