Affiliation of Author(s):计算机科学与技术学院/人工智能学院/软件学院
Journal:Lect. Notes Comput. Sci.
Abstract:Describing videos in human language is of vital importance in many applications, such as managing massive videos on line and providing descriptive video service (DVS) for blind people. In order to further promote existing video description frameworks, this paper presents an end-to-end deep learning model incorporating Convolutional Neural Networks (CNNs) and Bidirectional Recurrent Neural Networks (BiRNNs) based on a multimodal attention mechanism. Firstly, the model produces richer video representations, including image feature, motion feature and audio feature, than other similar researches. Secondly, BiRNNs model encodes these features in both forward and backward directions. Finally, an attention-based decoder translates sequential outputs of encoder to sequential words. The model is evaluated on Microsoft Research Video Description Corpus (MSVD) dataset. The results demonstrate the necessity of combining BiRNNs with a multimodal attention mechanism and the superiority of this model over other state-of-the-art methods conducted on this dataset. © Springer Nature Switzerland AG 2018.
ISSN No.:0302-9743
Translation or Not:no
Date of Publication:2018-01-01
Co-author:Du, Xiaotong,lz
Correspondence Author:Du, Xiaotong,Yuan Jiabing
Professor
Supervisor of Doctorate Candidates
Main positions:图书馆馆长
Alma Mater:南京航空航天大学
Education Level:南京航空航天大学
Degree:Doctoral Degree in Engineering
School/Department:College of Computer Science and Technology
Business Address:南京航空航天大学将军路校区计算机科学与技术学院院楼318
Contact Information:邮箱:jbyuan@nuaa.edu.cn 联系电话:13805165286
Open time:..
The Last Update Time:..