期刊文献+
共找到47篇文章
< 1 2 3 >
每页显示 20 50 100
MFSR: Maximum Feature Score Region-based Captions Locating in News Video Images
1
作者 Zhi-Heng Wang Chao Guo +1 位作者 Hong-Min Liu Zhan-Qiang Huo 《International Journal of Automation and computing》 EI CSCD 2018年第4期454-461,共8页
For news video images, caption recognizing is a useful and important step for content understanding. Caption locating is usually the first step of caption recognizing and this paper proposes a simple but effective cap... For news video images, caption recognizing is a useful and important step for content understanding. Caption locating is usually the first step of caption recognizing and this paper proposes a simple but effective caption locating algorithm called maximum feature score region (MFSR) based method, which mainly consists of two stages: In the first stage, up/down boundaries are attained by turning to edge map projection. Then, maximum feature score region is defined and left/right boundaries are achieved by utilizing MFSR. Experiments show that the proposed MFSR based method has superior and robust performance on news video images of different types. 展开更多
关键词 News video images captions recognizing captions locating content understanding maximum feature score region(MFSR).
原文传递
The Effect of TV Captions on the Comprehension of Non-Native Saudi Learners of English
2
作者 Mubarak Alkhatnai 《Sino-US English Teaching》 2012年第10期1573-1579,共7页
This paper investigates the effectiveness of closed captioning in aiding Saudi students who are learning ESL (English as a second language). Research was carried out in a qualitative manner, and participants were 12... This paper investigates the effectiveness of closed captioning in aiding Saudi students who are learning ESL (English as a second language). Research was carried out in a qualitative manner, and participants were 12 Saudi students pursuing their studies at Indiana University of Pennsylvania, USA (IUP). Participants in the study were asked to compose a narrative after viewing a 5-minute film segment, both with and without captioning. Their responses were then analyzed, and results indicated that while captions may aid one in comprehension, they also tend to limit one's interpretations, reaffirming the nature of written language as an authoritative source of information. 展开更多
关键词 TV (television) captions COMPREHENSION ESL (English as a second language) written text languageclassroom
在线阅读 下载PDF
TimeJudge:empowering video-LLMs as zero-shot judges for temporal consistency in video captions
3
作者 Yangliu HU Zikai SONG +2 位作者 Junqing YU Yiping Phoebe CHEN Wei YANG 《Frontiers of Information Technology & Electronic Engineering》 2025年第11期2204-2214,共11页
Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored... Video large language models(video-LLMs)have demonstrated impressive capabilities in multimodal understanding,but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored.Existing methods notably underperform in detecting critical temporal errors,such as missing,hallucinated,or misordered actions.To address this gap,we introduce two key contributions.(1)TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs.It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation.(2)TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels,specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task.Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench,we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning.Our approach provides a generalizable,scalable,and training-free solution for enhancing the temporal error detection capabilities of video-LLMs. 展开更多
关键词 Video large language model(Video-LLM) Multimodal large language model(MLLM) MLLM-as-a-Judge Video caption BENCHMARK
原文传递
THE EFFECTS OF CAPTIONS ON CHINESE EFL STUDENTS' INCIDENTAL VOCABULARY ACQUISITION 被引量:2
4
作者 汪徽 《Chinese Journal of Applied Linguistics》 2007年第4期9-16,128,共9页
This research investigated the effects of captions on Chinese EFL students' incidental vocabulary acquisition. The results are: 1) Captions contribute a lot to students' incidental vocabulary acquisition. In p... This research investigated the effects of captions on Chinese EFL students' incidental vocabulary acquisition. The results are: 1) Captions contribute a lot to students' incidental vocabulary acquisition. In particular, English captions more greatly enhanced students' mastery of word-spelling and listening word recognition. Chinese captions better improved students' mastery of word-meaning. 2) The effects of both L1 and L2 captions on vocabulary acquisition are conspicuous for students of both high and low L2 proficiency. But in terms of word meaning, students with low L2 proficiency benefit tremendously more from the L1 captions than from the L2 subtitles while the differences between the effects of L1 and L2 captions are not so great for the students of high L2 proficiency. The study suggested that captioned authentic video materials could be a good method to teach L2 vocabulary, but the teachers have to be careful in deciding when and how to use the captions. 展开更多
关键词 captions incidental vocabulary acquisition L1 L2
原文传递
Deconfounded fashion image captioning with transformer and multimodal retrieval
5
作者 Tao PENG Weiqiao YIN +2 位作者 Junping LIU Li LI Xinrong HU 《虚拟现实与智能硬件(中英文)》 2025年第2期127-138,共12页
Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task en... Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task entails multiple challenges,including the lack of fine-grained captions and confounders caused by dataset bias.Specifically,confounders often cause models to learn spurious correlations,thereby reducing their generalization capabilities.Method In this work,we propose the Deconfounded Fashion Image Captioning(DFIC)framework,which first uses multimodal retrieval to enrich the predicted captions of clothing,and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding.Multimodal retrieval is used to obtain semantic words related to image features,which are input into the decoder as prompt words to enrich sentence descriptions.In the decoder,causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.Results Overall,our method can not only effectively enrich the captions of target images,but also greatly reduce confounders caused by the dataset.To verify the effectiveness of the proposed framework,the model was experimentally verified using the FACAD dataset. 展开更多
关键词 Image caption Causal inference Fashion caption
在线阅读 下载PDF
UniTrans:Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model
6
作者 Jiakang Sun Ke Chen +3 位作者 Xinyang He Xu Liu Ke Li Cheng Peng 《Computers, Materials & Continua》 2025年第4期219-238,共20页
With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap... With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks. 展开更多
关键词 Parameter-efficient transfer learning multimodal alignment image captioning image-text retrieval visual question answering
在线阅读 下载PDF
Convolutional BiLSTM Variational Sequence-To-Sequence Based Video Captioning for Capturing Intricate Temporal Dependencies
7
作者 M.Gowri Shankar D.Surendran 《Journal of Bionic Engineering》 2025年第5期2700-2716,共17页
In the realm of video understanding,the demand for accurate and contextually rich video captioning has surged with the increasing volume and complexity of multimedia content.This research introduces an innovative solu... In the realm of video understanding,the demand for accurate and contextually rich video captioning has surged with the increasing volume and complexity of multimedia content.This research introduces an innovative solution for video captioning by integrating a Convolutional BiLSTM Convolutional Bidirectional Long Short-Term Memory(BiLSTM)constructed Variational Sequence-to-Sequence(CBVSS)approach.The proposed framework is adept at capturing intricate temporal dependencies within video sequences,enabling a more nuanced and contextually relevant description of dynamic scenes.However,optimizing its parameters for improved performance remains a crucial challenge.In response,in this research Golden Eagle Optimization(GEO)a metaheuristic optimization technique is used to fine-tune the Convolutional BiLSTM variational sequence-to-sequence model parameters.The application of GEO aims to enhancing the CBVSS ability to produce more exact and contextually rich video captions.The proposed attains an overall higher Recall of 59.75%and Precision of 63.78%for both datasets.Additionally,the proposed CBVSS method demonstrated superior performance across both datasets,achieving the highest METEOR(25.67)and CIDER(39.87)scores on the ActivityNet dataset,and further outperforming all compared models on the YouCook2 dataset with METEOR(28.67)and CIDER(43.02),highlighting its effectiveness in generating semantically rich and contextually accurate video captions. 展开更多
关键词 Video captioning Convolutional BiLSTM Variational sequence-to-sequence model Golden eagleoptimization Intricate temporal dependencies
在线阅读 下载PDF
A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques
8
作者 Alaa Thobhani Beiji Zou +4 位作者 Xiaoyan Kui Amr Abdussalam Muhammad Asim Sajid Shah Mohammed ELAffendi 《Computer Modeling in Engineering & Sciences》 2025年第3期2247-2280,共34页
Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Man... Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Many real-world applications rely on image captioning,such as helping people with visual impairments to see their surroundings.To formulate a coherent and relevant textual description,computer vision techniques are utilized to comprehend the visual content within an image,followed by natural language processing methods.Numerous approaches and models have been developed to deal with this multifaceted problem.Several models prove to be stateof-the-art solutions in this field.This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation.Rather than reviewing all previous image captioning work,we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements,including encompassing image captioning with visual attention methods,exploring semantic information types in captions,and employing multi-caption generation techniques.Further,advancements such as neural architecture search,few-shot learning,multi-phase learning,and cross-modal embedding within image caption networks are examined for their transformative effects.The comprehensive quantitative analysis conducted in this study identifies cutting-edgemethodologies and sheds light on their profound impact,driving forward the forefront of image captioning technology. 展开更多
关键词 Image captioning semantic attention multi-caption natural language processing visual attention methods
在线阅读 下载PDF
LREGT:Local Relationship Enhanced Gated Transformer for Image Captioning
9
作者 Yuting He Zetao Jiang 《Computers, Materials & Continua》 2025年第9期5487-5508,共22页
Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.Ho... Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.However,these models still face challenges in effectively capturing local associations.Moreover,since the encoder extracts global and local association features that focus on different semantic information,semantic noise may occur during the decoding stage.To address these issues,we propose the Local Relationship Enhanced Gated Transformer(LREGT).In the encoder part,we introduce the Local Relationship Enhanced Encoder(LREE),whose core component is the Local Relationship Enhanced Module(LREM).LREM consists of two novel designs:the Local Correlation Perception Module(LCPM)and the Local-Global Fusion Module(LGFM),which are beneficial for generating a comprehensive feature representation that integrates both global and local information.In the decoder part,we propose the Dual-level Multi-branch Gated Decoder(DMGD).It first creates multiple decoding branches to generate multi-perspective contextual feature representations.Subsequently,it employs the Dual-Level Gating Mechanism(DLGM)to model the multi-level relationships of these multi-perspective contextual features,enhancing their fine-grained semantics and intrinsic relationship representations.This ultimately leads to the generation of high-quality and semantically rich image captions.Experiments on the standard MSCOCO dataset demonstrate that LREGT achieves state-of-the-art performance,with a CIDEr score of 140.8 and BLEU-4 score of 41.3,significantly outperforming existing mainstream methods.These results highlight LREGT’s superiority in capturing complex visual relationships and resolving semantic noise during decoding. 展开更多
关键词 Image captioning local relation enhancement local correlation perception dual-level gating mechanism
在线阅读 下载PDF
Visuals to Text:A Comprehensive Review on Automatic Image Captioning 被引量:5
10
作者 Yue Ming Nannan Hu +3 位作者 Chunxiao Fan Fan Feng Jiangwan Zhou Hui Yu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2022年第8期1339-1365,共27页
Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language proc... Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language processing(NLP)and artificial intelligence.In recent years,substantial research efforts have been devoted to generate image caption with impressive progress.To summarize the recent advances in image captioning,we present a comprehensive review on image captioning,covering both traditional methods and recent deep learning-based techniques.Specifically,we first briefly review the early traditional works based on the retrieval and template.Then deep learning-based image captioning researches are focused,which is categorized into the encoder-decoder framework,attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction.After that,we summarize the publicly available datasets,evaluation metrics and those proposed for specific requirements,and then compare the state of the art methods on the MS COCO dataset.Finally,we provide some discussions on open challenges and future research directions. 展开更多
关键词 Artificial intelligence attention mechanism encoder-decoder framework image captioning multi-modal understanding training strategies
在线阅读 下载PDF
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning 被引量:3
11
作者 WEI Tingting YUAN Weilin +2 位作者 LUO Junren ZHANG Wanpeng LU Lina 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2023年第1期9-18,共10页
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi... In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions. 展开更多
关键词 remote sensing image captioning(RSIC) vision-language representation remote sensing image caption dataset attention mechanism
在线阅读 下载PDF
A Position-Aware Transformer for Image Captioning 被引量:3
12
作者 Zelin Deng Bo Zhou +3 位作者 Pei He Jianfeng Huang Osama Alfarraj Amr Tolba 《Computers, Materials & Continua》 SCIE EI 2022年第1期2065-2081,共17页
Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short Ter... Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short TermMemory(LSTM)are used to translate an image into a natural language description.Among these approaches,the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.However,most conventional visual attention mechanisms are based on high-level image features,ignoring the effects of other image features,and giving insufficient consideration to the relative positions between image features.In this work,we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems.The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network(FPN),then utilizes the scaled-dot-product to fuse these features,which enables our model to detect objects of different scales in the image more effectivelywithout increasing parameters.In the position-aware attentionmechanism,the relative positions between image features are obtained at first,afterwards the relative positions are incorporated into the original image features to generate captions more accurately.Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4,METEOR,ROUGE-L,CIDEr scores compared with some state-of-the-art approaches,demonstrating the effectiveness of our approach. 展开更多
关键词 Deep learning image captioning TRANSFORMER ATTENTION position-aware
在线阅读 下载PDF
Global-Attention-Based Neural Networks for Vision Language Intelligence 被引量:3
13
作者 Pei Liu Yingjie Zhou +1 位作者 Dezhong Peng Dapeng Wu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2021年第7期1243-1252,共10页
In this paper,we develop a novel global-attentionbased neural network(GANN)for vision language intelligence,specifically,image captioning(language description of a given image).As many previous works,the encoder-decod... In this paper,we develop a novel global-attentionbased neural network(GANN)for vision language intelligence,specifically,image captioning(language description of a given image).As many previous works,the encoder-decoder framework is adopted in our proposed model,in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects,and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer.The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning,and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation.Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer.In our experiments,we qualitatively analyzed the proposed model,and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset.Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning. 展开更多
关键词 Global attention image captioning latent contribution
在线阅读 下载PDF
Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System 被引量:1
14
作者 Radwa Marzouk Eatedal Alabdulkreem +5 位作者 Mohamed KNour Mesfer Al Duhayyim Mahmoud Othman Abu Sarwar Zamani Ishfaq Yaseen Abdelwahed Motwakel 《Computers, Materials & Continua》 SCIE EI 2023年第2期4435-4451,共17页
The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models... The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models. 展开更多
关键词 Natural language processing information retrieval image captioning deep learning metaheuristics
在线阅读 下载PDF
Improved image captioning with subword units training and transformer 被引量:1
15
作者 Cai Qiang Li Jing +1 位作者 Li Haisheng Zuo Min 《High Technology Letters》 EI CAS 2020年第2期211-216,共6页
Image captioning models typically operate with a fixed vocabulary,but captioning is an open-vocabulary problem.Existing work addresses the image captioning of out-of-vocabulary words by labeling it as unknown in a dic... Image captioning models typically operate with a fixed vocabulary,but captioning is an open-vocabulary problem.Existing work addresses the image captioning of out-of-vocabulary words by labeling it as unknown in a dictionary.In addition,recurrent neural network(RNN)and its variants used in the caption task have become a bottleneck for their generation quality and training time cost.To address these 2 essential problems,a simpler but more effective approach is proposed for generating open-vocabulary caption,long short-term memory(LSTM)unit is replaced with transformer as decoder for better caption quality and less training time.The effectiveness of different word segmentation vocabulary and generation improvement of transformer over LSTM is discussed and it is proved that the improved models achieve state-of-the-art performance for the MSCOCO2014 image captioning tasks over a back-off dictionary baseline model. 展开更多
关键词 image captioning transformer BYTE PAIR encoding(BPE) REINFORCEMENT learning
在线阅读 下载PDF
Efficient Image Captioning Based on Vision Transformer Models
16
作者 Samar Elbedwehy T.Medhat +1 位作者 Taher Hamza Mohammed F.Alrahmawy 《Computers, Materials & Continua》 SCIE EI 2022年第10期1483-1500,共18页
Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning... Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions.Attention-based vision transformers models have a great impact in vision field recently.In this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional layers.The third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token dimensions.The last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the image.For a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN transformer.The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models. 展开更多
关键词 Image captioning sequence-to-sequence self-distillation transformer convolutional layer
在线阅读 下载PDF
Image Captioning Using Multimodal Deep Learning Approach
17
作者 Rihem Farkh Ghislain Oudinet Yasser Foued 《Computers, Materials & Continua》 SCIE EI 2024年第12期3951-3968,共18页
The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly ... The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly grasping image content and producing coherent,contextually relevant captions continues to pose a substantial challenge.In this paper,we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures:YOLOv8(You Only Look Once)for robust object detection,EfficientNetB7 for efficient feature extraction,and Transformers for effective sequence modeling.Our proposed model combines the strengths of YOLOv8 in detecting objects,the superior feature representation capabilities of EfficientNetB7,and the contextual understanding and sequential generation abilities of Transformers.We conduct extensive experiments on standard benchmark datasets to evaluate the effectiveness of our approach,demonstrating its ability to generate informative and semantically rich captions for diverse images.The experimental results showcase the synergistic benefits of integrating YOLOv8,EfficientNetB7,and Transformers in advancing the state-of-the-art in image captioning tasks.The proposed multimodal approach has yielded impressive outcomes,generating informative and semantically rich captions for a diverse range of images.By combining the strengths of YOLOv8,EfficientNetB7,and Transformers,the model has achieved state-of-the-art results in image captioning tasks.The significance of this approach lies in its ability to address the challenging task of generating coherent and contextually relevant captions while achieving a comprehensive understanding of image content.The integration of three powerful deep learning architectures demonstrates the synergistic benefits of multimodal fusion in advancing the state-of-the-art in image captioning.Furthermore,this approach has a profound impact on the field,opening up new avenues for research in multimodal deep learning and paving the way for more sophisticated and context-aware image captioning systems.These systems have the potential to make significant contributions to various fields,encompassing human-computer interaction,computer vision and natural language processing. 展开更多
关键词 Image caption multimodelmethods YOLOv8 efficientNetB7 features extration TRANSFORMERS ENCODER DECODER Flickr8k
在线阅读 下载PDF
A deep dense captioning framework with joint localization and contextual reasoning
18
作者 KONG Rui XIE Wei 《Journal of Central South University》 SCIE EI CAS CSCD 2021年第9期2801-2813,共13页
Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate loca... Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate localization of each target region challenging;2)some visually ambiguous target regions which are hard to recognize each of them just by appearance;3)an extremely deep image representation which is of central importance for visual recognition.To tackle these three challenges,we propose a novel end-to-end dense captioning framework consisting of a joint localization module,a contextual reasoning module and a deep convolutional neural network(CNN).We also evaluate five deep CNN structures to explore the benefits of each.Extensive experiments on visual genome(VG)dataset demonstrate the effectiveness of our approach,which compares favorably with the state-of-the-art methods. 展开更多
关键词 dense captioning joint localization contextual reasoning deep convolutional neural network
在线阅读 下载PDF
A Video Captioning Method by Semantic Topic-Guided Generation
19
作者 Ou Ye Xinli Wei +2 位作者 Zhenhua Yu Yan Fu Ying Yang 《Computers, Materials & Continua》 SCIE EI 2024年第1期1071-1093,共23页
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de... In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits. 展开更多
关键词 Video captioning encoder-decoder semantic topic jointly decoding Enhance-TopK sampling
在线阅读 下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部