期刊文献+
共找到44篇文章
< 1 2 3 >
每页显示 20 50 100
MFSR: Maximum Feature Score Region-based Captions Locating in News Video Images
1
作者 Zhi-Heng Wang Chao Guo +1 位作者 Hong-Min Liu Zhan-Qiang Huo 《International Journal of Automation and computing》 EI CSCD 2018年第4期454-461,共8页
For news video images, caption recognizing is a useful and important step for content understanding. Caption locating is usually the first step of caption recognizing and this paper proposes a simple but effective cap... For news video images, caption recognizing is a useful and important step for content understanding. Caption locating is usually the first step of caption recognizing and this paper proposes a simple but effective caption locating algorithm called maximum feature score region (MFSR) based method, which mainly consists of two stages: In the first stage, up/down boundaries are attained by turning to edge map projection. Then, maximum feature score region is defined and left/right boundaries are achieved by utilizing MFSR. Experiments show that the proposed MFSR based method has superior and robust performance on news video images of different types. 展开更多
关键词 News video images captions recognizing captions locating content understanding maximum feature score region(MFSR).
原文传递
The Effect of TV Captions on the Comprehension of Non-Native Saudi Learners of English
2
作者 Mubarak Alkhatnai 《Sino-US English Teaching》 2012年第10期1573-1579,共7页
This paper investigates the effectiveness of closed captioning in aiding Saudi students who are learning ESL (English as a second language). Research was carried out in a qualitative manner, and participants were 12... This paper investigates the effectiveness of closed captioning in aiding Saudi students who are learning ESL (English as a second language). Research was carried out in a qualitative manner, and participants were 12 Saudi students pursuing their studies at Indiana University of Pennsylvania, USA (IUP). Participants in the study were asked to compose a narrative after viewing a 5-minute film segment, both with and without captioning. Their responses were then analyzed, and results indicated that while captions may aid one in comprehension, they also tend to limit one's interpretations, reaffirming the nature of written language as an authoritative source of information. 展开更多
关键词 TV (television) captions COMPREHENSION ESL (English as a second language) written text languageclassroom
在线阅读 下载PDF
THE EFFECTS OF CAPTIONS ON CHINESE EFL STUDENTS' INCIDENTAL VOCABULARY ACQUISITION 被引量:2
3
作者 汪徽 《Chinese Journal of Applied Linguistics》 2007年第4期9-16,128,共9页
This research investigated the effects of captions on Chinese EFL students' incidental vocabulary acquisition. The results are: 1) Captions contribute a lot to students' incidental vocabulary acquisition. In p... This research investigated the effects of captions on Chinese EFL students' incidental vocabulary acquisition. The results are: 1) Captions contribute a lot to students' incidental vocabulary acquisition. In particular, English captions more greatly enhanced students' mastery of word-spelling and listening word recognition. Chinese captions better improved students' mastery of word-meaning. 2) The effects of both L1 and L2 captions on vocabulary acquisition are conspicuous for students of both high and low L2 proficiency. But in terms of word meaning, students with low L2 proficiency benefit tremendously more from the L1 captions than from the L2 subtitles while the differences between the effects of L1 and L2 captions are not so great for the students of high L2 proficiency. The study suggested that captioned authentic video materials could be a good method to teach L2 vocabulary, but the teachers have to be careful in deciding when and how to use the captions. 展开更多
关键词 captions incidental vocabulary acquisition L1 L2
原文传递
Deconfounded fashion image captioning with transformer and multimodal retrieval
4
作者 Tao PENG Weiqiao YIN +2 位作者 Junping LIU Li LI Xinrong HU 《虚拟现实与智能硬件(中英文)》 2025年第2期127-138,共12页
Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task en... Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task entails multiple challenges,including the lack of fine-grained captions and confounders caused by dataset bias.Specifically,confounders often cause models to learn spurious correlations,thereby reducing their generalization capabilities.Method In this work,we propose the Deconfounded Fashion Image Captioning(DFIC)framework,which first uses multimodal retrieval to enrich the predicted captions of clothing,and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding.Multimodal retrieval is used to obtain semantic words related to image features,which are input into the decoder as prompt words to enrich sentence descriptions.In the decoder,causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.Results Overall,our method can not only effectively enrich the captions of target images,but also greatly reduce confounders caused by the dataset.To verify the effectiveness of the proposed framework,the model was experimentally verified using the FACAD dataset. 展开更多
关键词 Image caption Causal inference Fashion caption
在线阅读 下载PDF
UniTrans:Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model
5
作者 Jiakang Sun Ke Chen +3 位作者 Xinyang He Xu Liu Ke Li Cheng Peng 《Computers, Materials & Continua》 2025年第4期219-238,共20页
With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap... With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks. 展开更多
关键词 Parameter-efficient transfer learning multimodal alignment image captioning image-text retrieval visual question answering
在线阅读 下载PDF
A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques
6
作者 Alaa Thobhani Beiji Zou +4 位作者 Xiaoyan Kui Amr Abdussalam Muhammad Asim Sajid Shah Mohammed ELAffendi 《Computer Modeling in Engineering & Sciences》 2025年第3期2247-2280,共34页
Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Man... Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Many real-world applications rely on image captioning,such as helping people with visual impairments to see their surroundings.To formulate a coherent and relevant textual description,computer vision techniques are utilized to comprehend the visual content within an image,followed by natural language processing methods.Numerous approaches and models have been developed to deal with this multifaceted problem.Several models prove to be stateof-the-art solutions in this field.This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation.Rather than reviewing all previous image captioning work,we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements,including encompassing image captioning with visual attention methods,exploring semantic information types in captions,and employing multi-caption generation techniques.Further,advancements such as neural architecture search,few-shot learning,multi-phase learning,and cross-modal embedding within image caption networks are examined for their transformative effects.The comprehensive quantitative analysis conducted in this study identifies cutting-edgemethodologies and sheds light on their profound impact,driving forward the forefront of image captioning technology. 展开更多
关键词 Image captioning semantic attention multi-caption natural language processing visual attention methods
在线阅读 下载PDF
LREGT:Local Relationship Enhanced Gated Transformer for Image Captioning
7
作者 Yuting He Zetao Jiang 《Computers, Materials & Continua》 2025年第9期5487-5508,共22页
Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.Ho... Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.However,these models still face challenges in effectively capturing local associations.Moreover,since the encoder extracts global and local association features that focus on different semantic information,semantic noise may occur during the decoding stage.To address these issues,we propose the Local Relationship Enhanced Gated Transformer(LREGT).In the encoder part,we introduce the Local Relationship Enhanced Encoder(LREE),whose core component is the Local Relationship Enhanced Module(LREM).LREM consists of two novel designs:the Local Correlation Perception Module(LCPM)and the Local-Global Fusion Module(LGFM),which are beneficial for generating a comprehensive feature representation that integrates both global and local information.In the decoder part,we propose the Dual-level Multi-branch Gated Decoder(DMGD).It first creates multiple decoding branches to generate multi-perspective contextual feature representations.Subsequently,it employs the Dual-Level Gating Mechanism(DLGM)to model the multi-level relationships of these multi-perspective contextual features,enhancing their fine-grained semantics and intrinsic relationship representations.This ultimately leads to the generation of high-quality and semantically rich image captions.Experiments on the standard MSCOCO dataset demonstrate that LREGT achieves state-of-the-art performance,with a CIDEr score of 140.8 and BLEU-4 score of 41.3,significantly outperforming existing mainstream methods.These results highlight LREGT’s superiority in capturing complex visual relationships and resolving semantic noise during decoding. 展开更多
关键词 Image captioning local relation enhancement local correlation perception dual-level gating mechanism
在线阅读 下载PDF
Image Captioning Using Multimodal Deep Learning Approach
8
作者 Rihem Farkh Ghislain Oudinet Yasser Foued 《Computers, Materials & Continua》 SCIE EI 2024年第12期3951-3968,共18页
The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly ... The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly grasping image content and producing coherent,contextually relevant captions continues to pose a substantial challenge.In this paper,we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures:YOLOv8(You Only Look Once)for robust object detection,EfficientNetB7 for efficient feature extraction,and Transformers for effective sequence modeling.Our proposed model combines the strengths of YOLOv8 in detecting objects,the superior feature representation capabilities of EfficientNetB7,and the contextual understanding and sequential generation abilities of Transformers.We conduct extensive experiments on standard benchmark datasets to evaluate the effectiveness of our approach,demonstrating its ability to generate informative and semantically rich captions for diverse images.The experimental results showcase the synergistic benefits of integrating YOLOv8,EfficientNetB7,and Transformers in advancing the state-of-the-art in image captioning tasks.The proposed multimodal approach has yielded impressive outcomes,generating informative and semantically rich captions for a diverse range of images.By combining the strengths of YOLOv8,EfficientNetB7,and Transformers,the model has achieved state-of-the-art results in image captioning tasks.The significance of this approach lies in its ability to address the challenging task of generating coherent and contextually relevant captions while achieving a comprehensive understanding of image content.The integration of three powerful deep learning architectures demonstrates the synergistic benefits of multimodal fusion in advancing the state-of-the-art in image captioning.Furthermore,this approach has a profound impact on the field,opening up new avenues for research in multimodal deep learning and paving the way for more sophisticated and context-aware image captioning systems.These systems have the potential to make significant contributions to various fields,encompassing human-computer interaction,computer vision and natural language processing. 展开更多
关键词 Image caption multimodelmethods YOLOv8 efficientNetB7 features extration TRANSFORMERS ENCODER DECODER Flickr8k
在线阅读 下载PDF
A Video Captioning Method by Semantic Topic-Guided Generation
9
作者 Ou Ye Xinli Wei +2 位作者 Zhenhua Yu Yan Fu Ying Yang 《Computers, Materials & Continua》 SCIE EI 2024年第1期1071-1093,共23页
In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is de... In the video captioning methods based on an encoder-decoder,limited visual features are extracted by an encoder,and a natural sentence of the video content is generated using a decoder.However,this kind ofmethod is dependent on a single video input source and few visual labels,and there is a problem with semantic alignment between video contents and generated natural sentences,which are not suitable for accurately comprehending and describing the video contents.To address this issue,this paper proposes a video captioning method by semantic topic-guided generation.First,a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding.Then,the semantic topics of video data are extracted using the visual labels retrieved from similar video data.In the decoding,a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network,which decreases the influence of“deviation”in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents.During this process,the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words.Finally,the experiments are conducted on two publicly used Microsoft Research Video Description andMicrosoft Research-Video to Text datasets.The experimental results demonstrate that the proposed method outperforms several state-of-art approaches.Specifically,the performance indicators Bilingual Evaluation Understudy,Metric for Evaluation of Translation with Explicit Ordering,Recall Oriented Understudy for Gisting Evaluation-longest common subsequence,and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%,0.1%,0.3%,and 2.4% on the Microsoft Research Video Description dataset,and 0.1%,1.0%,0.1%,and 2.8% on the Microsoft Research-Video to Text dataset,respectively,compared with the existing video captioning methods.As a result,the proposed method can generate video captioning that is more closely aligned with human natural language expression habits. 展开更多
关键词 Video captioning encoder-decoder semantic topic jointly decoding Enhance-TopK sampling
在线阅读 下载PDF
Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video:A Review
10
作者 Ekanayake Mudiyanselage Chulabhaya Lankanatha Ekanayake Abubakar Sulaiman Gezawa Yunqi Lei 《Computers, Materials & Continua》 SCIE EI 2024年第3期2941-2965,共25页
Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It... Video description generates natural language sentences that describe the subject,verb,and objects of the targeted Video.The video description has been used to help visually impaired people to understand the content.It is also playing an essential role in devolving human-robot interaction.The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping.Deep learning is changing the shape of computer vision(CV)technologies and natural language processing(NLP).There are hundreds of deep learning models,datasets,and evaluations that can improve the gaps in current research.This article filled this gap by evaluating some state-of-the-art approaches,especially focusing on deep learning and machine learning for video caption in a dense environment.In this article,some classic techniques concerning the existing machine learning were reviewed.And provides deep learning models,a detail of benchmark datasets with their respective domains.This paper reviews various evaluation metrics,including Bilingual EvaluationUnderstudy(BLEU),Metric for Evaluation of Translation with Explicit Ordering(METEOR),WordMover’s Distance(WMD),and Recall-Oriented Understudy for Gisting Evaluation(ROUGE)with their pros and cons.Finally,this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame.Especially,how to improve the context of video description by analyzing key frames detection through morphological image analysis.Additionally,the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection,which incorporates the fusion of large languagemodels for refining results.The ultimate results arise fromenhancing the generated text of the proposedmodel by improving the predicted text and isolating objects using various keyframes.These keyframes identify dense events occurring in the video sequence. 展开更多
关键词 Video description video to text video caption sentence reconstruction
在线阅读 下载PDF
A Concise and Varied Visual Features-Based Image Captioning Model with Visual Selection
11
作者 Alaa Thobhani Beiji Zou +4 位作者 Xiaoyan Kui Amr Abdussalam Muhammad Asim Naveed Ahmed Mohammed Ali Alshara 《Computers, Materials & Continua》 SCIE EI 2024年第11期2873-2894,共22页
Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms... Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image,improving the effectiveness of identifying relevant image regions at each step of caption generation.However,providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features.Consequently,this leads to enhanced captioning network performance.In light of this,we present an image captioning framework that efficiently exploits the extracted representations of the image.Our framework comprises three key components:the Visual Feature Detector module(VFD),the Visual Feature Visual Attention module(VFVA),and the language model.The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features,creating an updated visual features matrix.Subsequently,the VFVA directs its attention to the visual features matrix generated by the VFD,resulting in an updated context vector employed by the language model to generate an informative description.Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features,thereby contributing to enhancing the image captioning model’s performance.Using the MS-COCO dataset,our experiments show that the proposed framework competes well with state-of-the-art methods,effectively leveraging visual representations to improve performance.The implementation code can be found here:https://github.com/althobhani/VFDICM(accessed on 30 July 2024). 展开更多
关键词 Visual attention image captioning visual feature detector visual feature visual attention
在线阅读 下载PDF
IQAGPT:computed tomography image quality assessment with vision-language and ChatGPT models
12
作者 Zhihao Chen Bin Hu +4 位作者 Chuang Niu Tao Chen Yuxin Li Hongming Shan Ge Wang 《Visual Computing for Industry,Biomedicine,and Art》 2024年第1期165-181,共17页
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua... Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images. 展开更多
关键词 Deep learning Medical imaging Image captioning MULTIMODALITY Large language model Vision-language model GPT-4 Subjective evaluation
在线阅读 下载PDF
Visuals to Text:A Comprehensive Review on Automatic Image Captioning 被引量:5
13
作者 Yue Ming Nannan Hu +3 位作者 Chunxiao Fan Fan Feng Jiangwan Zhou Hui Yu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2022年第8期1339-1365,共27页
Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language proc... Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language processing(NLP)and artificial intelligence.In recent years,substantial research efforts have been devoted to generate image caption with impressive progress.To summarize the recent advances in image captioning,we present a comprehensive review on image captioning,covering both traditional methods and recent deep learning-based techniques.Specifically,we first briefly review the early traditional works based on the retrieval and template.Then deep learning-based image captioning researches are focused,which is categorized into the encoder-decoder framework,attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction.After that,we summarize the publicly available datasets,evaluation metrics and those proposed for specific requirements,and then compare the state of the art methods on the MS COCO dataset.Finally,we provide some discussions on open challenges and future research directions. 展开更多
关键词 Artificial intelligence attention mechanism encoder-decoder framework image captioning multi-modal understanding training strategies
在线阅读 下载PDF
VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning 被引量:3
14
作者 WEI Tingting YUAN Weilin +2 位作者 LUO Junren ZHANG Wanpeng LU Lina 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2023年第1期9-18,共10页
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi... In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions. 展开更多
关键词 remote sensing image captioning(RSIC) vision-language representation remote sensing image caption dataset attention mechanism
在线阅读 下载PDF
A Position-Aware Transformer for Image Captioning 被引量:3
15
作者 Zelin Deng Bo Zhou +3 位作者 Pei He Jianfeng Huang Osama Alfarraj Amr Tolba 《Computers, Materials & Continua》 SCIE EI 2022年第1期2065-2081,共17页
Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short Ter... Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short TermMemory(LSTM)are used to translate an image into a natural language description.Among these approaches,the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.However,most conventional visual attention mechanisms are based on high-level image features,ignoring the effects of other image features,and giving insufficient consideration to the relative positions between image features.In this work,we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems.The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network(FPN),then utilizes the scaled-dot-product to fuse these features,which enables our model to detect objects of different scales in the image more effectivelywithout increasing parameters.In the position-aware attentionmechanism,the relative positions between image features are obtained at first,afterwards the relative positions are incorporated into the original image features to generate captions more accurately.Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4,METEOR,ROUGE-L,CIDEr scores compared with some state-of-the-art approaches,demonstrating the effectiveness of our approach. 展开更多
关键词 Deep learning image captioning TRANSFORMER ATTENTION position-aware
在线阅读 下载PDF
Global-Attention-Based Neural Networks for Vision Language Intelligence 被引量:3
16
作者 Pei Liu Yingjie Zhou +1 位作者 Dezhong Peng Dapeng Wu 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2021年第7期1243-1252,共10页
In this paper,we develop a novel global-attentionbased neural network(GANN)for vision language intelligence,specifically,image captioning(language description of a given image).As many previous works,the encoder-decod... In this paper,we develop a novel global-attentionbased neural network(GANN)for vision language intelligence,specifically,image captioning(language description of a given image).As many previous works,the encoder-decoder framework is adopted in our proposed model,in which the encoder is responsible for encoding the region proposal features and extracting global caption feature based on a specially designed module of predicting the caption objects,and the decoder generates captions by taking the obtained global caption feature along with the encoded visual features as inputs for each attention head of the decoder layer.The global caption feature is introduced for the purpose of exploring the latent contributions of region proposals for image captioning,and further helping the decoder better focus on the most relevant proposals so as to extract more accurate visual feature in each time step of caption generation.Our GANN is implemented by incorporating the global caption feature into the attention weight calculation phase in the word predication process in each head of the decoder layer.In our experiments,we qualitatively analyzed the proposed model,and quantitatively evaluated several state-of-the-art schemes with GANN on the MS-COCO dataset.Experimental results demonstrate the effectiveness of the proposed global attention mechanism for image captioning. 展开更多
关键词 Global attention image captioning latent contribution
在线阅读 下载PDF
Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System 被引量:1
17
作者 Radwa Marzouk Eatedal Alabdulkreem +5 位作者 Mohamed KNour Mesfer Al Duhayyim Mahmoud Othman Abu Sarwar Zamani Ishfaq Yaseen Abdelwahed Motwakel 《Computers, Materials & Continua》 SCIE EI 2023年第2期4435-4451,共17页
The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models... The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models. 展开更多
关键词 Natural language processing information retrieval image captioning deep learning metaheuristics
在线阅读 下载PDF
Improved image captioning with subword units training and transformer 被引量:1
18
作者 Cai Qiang Li Jing +1 位作者 Li Haisheng Zuo Min 《High Technology Letters》 EI CAS 2020年第2期211-216,共6页
Image captioning models typically operate with a fixed vocabulary,but captioning is an open-vocabulary problem.Existing work addresses the image captioning of out-of-vocabulary words by labeling it as unknown in a dic... Image captioning models typically operate with a fixed vocabulary,but captioning is an open-vocabulary problem.Existing work addresses the image captioning of out-of-vocabulary words by labeling it as unknown in a dictionary.In addition,recurrent neural network(RNN)and its variants used in the caption task have become a bottleneck for their generation quality and training time cost.To address these 2 essential problems,a simpler but more effective approach is proposed for generating open-vocabulary caption,long short-term memory(LSTM)unit is replaced with transformer as decoder for better caption quality and less training time.The effectiveness of different word segmentation vocabulary and generation improvement of transformer over LSTM is discussed and it is proved that the improved models achieve state-of-the-art performance for the MSCOCO2014 image captioning tasks over a back-off dictionary baseline model. 展开更多
关键词 image captioning transformer BYTE PAIR encoding(BPE) REINFORCEMENT learning
在线阅读 下载PDF
Efficient Image Captioning Based on Vision Transformer Models
19
作者 Samar Elbedwehy T.Medhat +1 位作者 Taher Hamza Mohammed F.Alrahmawy 《Computers, Materials & Continua》 SCIE EI 2022年第10期1483-1500,共18页
Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning... Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions.Attention-based vision transformers models have a great impact in vision field recently.In this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional layers.The third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token dimensions.The last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the image.For a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN transformer.The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models. 展开更多
关键词 Image captioning sequence-to-sequence self-distillation transformer convolutional layer
在线阅读 下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部