A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

资源分类

2019-12-10 |

49 |

39 |

Abstract

The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas from the bottom-up and top-down approaches to image description and propose a method for video description that captures the most relevant contents of a video in a natural language description. We propose a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce fifinal lingual descriptions. We compare the results of our system to human descriptions in both short and long forms on two datasets, and demonstrate that fifinal system output has greater agreement with the human descriptions than any single level.

上一篇：Multi-Class Video Co-Segmentation with a Generative Multi-Video Model

下一篇：Representing Videos using Mid-level Discriminative Patches

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to learn...

The move from hand-designed features to learned...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
Rating-Boosted La...

The performance of a recommendation system reli...
Hierarchical Task...

We extend hierarchical task network planning wi...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com