Video Captioning Using Interleaved Semantic Bidirectional Network

Main Article Content

Supriya Kurlekar, Manasi Dixit

Abstract

Video captioning is the task of which an automatically generate descriptive natural language sentences for video content. This can be considered as a link between computer vision techniques and natural language processing, which will enable the machine for proper interpretation and communication of visual information properly. Limited contextual understanding is seen in the traditional models due to the inability to capture spatial and temporal dependencies. In this paper we suggest a novel deep learning architecture the Interleaved Semantic Bidirectional Network (ISBN), which can address the limitations of traditional approaches by interleaving visual and semantic embeddings within a bidirectional processing framework. This model incorporates spatial and temporal features extracted via CNNs and 3D-CNNs, which is enriched with semantic information such as detected objects and their actions. Further Bi-LSTM jointly encodes these features which is followed by dual attention mechanisms which can guide caption generation. Robustness is improived by Bayesian inference employed to model uncertainty. Experimental evaluations were performed on widely used datasets such as MSVD and MSR-VTT which demonstrate that ISBN achieves superior performance across key metrics which include BLEU, METEOR, and CIDEr, outperforming several state-of-the-art baselines. The proposed model is effective in producing more context-aware and human-like captions, especially in complex video scenes involving multiple interacting entities.

Article Details

Section
Articles