Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Man...Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Many real-world applications rely on image captioning,such as helping people with visual impairments to see their surroundings.To formulate a coherent and relevant textual description,computer vision techniques are utilized to comprehend the visual content within an image,followed by natural language processing methods.Numerous approaches and models have been developed to deal with this multifaceted problem.Several models prove to be stateof-the-art solutions in this field.This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation.Rather than reviewing all previous image captioning work,we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements,including encompassing image captioning with visual attention methods,exploring semantic information types in captions,and employing multi-caption generation techniques.Further,advancements such as neural architecture search,few-shot learning,multi-phase learning,and cross-modal embedding within image caption networks are examined for their transformative effects.The comprehensive quantitative analysis conducted in this study identifies cutting-edgemethodologies and sheds light on their profound impact,driving forward the forefront of image captioning technology.展开更多
Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.Ho...Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.However,these models still face challenges in effectively capturing local associations.Moreover,since the encoder extracts global and local association features that focus on different semantic information,semantic noise may occur during the decoding stage.To address these issues,we propose the Local Relationship Enhanced Gated Transformer(LREGT).In the encoder part,we introduce the Local Relationship Enhanced Encoder(LREE),whose core component is the Local Relationship Enhanced Module(LREM).LREM consists of two novel designs:the Local Correlation Perception Module(LCPM)and the Local-Global Fusion Module(LGFM),which are beneficial for generating a comprehensive feature representation that integrates both global and local information.In the decoder part,we propose the Dual-level Multi-branch Gated Decoder(DMGD).It first creates multiple decoding branches to generate multi-perspective contextual feature representations.Subsequently,it employs the Dual-Level Gating Mechanism(DLGM)to model the multi-level relationships of these multi-perspective contextual features,enhancing their fine-grained semantics and intrinsic relationship representations.This ultimately leads to the generation of high-quality and semantically rich image captions.Experiments on the standard MSCOCO dataset demonstrate that LREGT achieves state-of-the-art performance,with a CIDEr score of 140.8 and BLEU-4 score of 41.3,significantly outperforming existing mainstream methods.These results highlight LREGT’s superiority in capturing complex visual relationships and resolving semantic noise during decoding.展开更多
Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task en...Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task entails multiple challenges,including the lack of fine-grained captions and confounders caused by dataset bias.Specifically,confounders often cause models to learn spurious correlations,thereby reducing their generalization capabilities.Method In this work,we propose the Deconfounded Fashion Image Captioning(DFIC)framework,which first uses multimodal retrieval to enrich the predicted captions of clothing,and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding.Multimodal retrieval is used to obtain semantic words related to image features,which are input into the decoder as prompt words to enrich sentence descriptions.In the decoder,causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.Results Overall,our method can not only effectively enrich the captions of target images,but also greatly reduce confounders caused by the dataset.To verify the effectiveness of the proposed framework,the model was experimentally verified using the FACAD dataset.展开更多
Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms...Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image,improving the effectiveness of identifying relevant image regions at each step of caption generation.However,providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features.Consequently,this leads to enhanced captioning network performance.In light of this,we present an image captioning framework that efficiently exploits the extracted representations of the image.Our framework comprises three key components:the Visual Feature Detector module(VFD),the Visual Feature Visual Attention module(VFVA),and the language model.The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features,creating an updated visual features matrix.Subsequently,the VFVA directs its attention to the visual features matrix generated by the VFD,resulting in an updated context vector employed by the language model to generate an informative description.Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features,thereby contributing to enhancing the image captioning model’s performance.Using the MS-COCO dataset,our experiments show that the proposed framework competes well with state-of-the-art methods,effectively leveraging visual representations to improve performance.The implementation code can be found here:https://github.com/althobhani/VFDICM(accessed on 30 July 2024).展开更多
The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly ...The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly grasping image content and producing coherent,contextually relevant captions continues to pose a substantial challenge.In this paper,we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures:YOLOv8(You Only Look Once)for robust object detection,EfficientNetB7 for efficient feature extraction,and Transformers for effective sequence modeling.Our proposed model combines the strengths of YOLOv8 in detecting objects,the superior feature representation capabilities of EfficientNetB7,and the contextual understanding and sequential generation abilities of Transformers.We conduct extensive experiments on standard benchmark datasets to evaluate the effectiveness of our approach,demonstrating its ability to generate informative and semantically rich captions for diverse images.The experimental results showcase the synergistic benefits of integrating YOLOv8,EfficientNetB7,and Transformers in advancing the state-of-the-art in image captioning tasks.The proposed multimodal approach has yielded impressive outcomes,generating informative and semantically rich captions for a diverse range of images.By combining the strengths of YOLOv8,EfficientNetB7,and Transformers,the model has achieved state-of-the-art results in image captioning tasks.The significance of this approach lies in its ability to address the challenging task of generating coherent and contextually relevant captions while achieving a comprehensive understanding of image content.The integration of three powerful deep learning architectures demonstrates the synergistic benefits of multimodal fusion in advancing the state-of-the-art in image captioning.Furthermore,this approach has a profound impact on the field,opening up new avenues for research in multimodal deep learning and paving the way for more sophisticated and context-aware image captioning systems.These systems have the potential to make significant contributions to various fields,encompassing human-computer interaction,computer vision and natural language processing.展开更多
Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language proc...Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language processing(NLP)and artificial intelligence.In recent years,substantial research efforts have been devoted to generate image caption with impressive progress.To summarize the recent advances in image captioning,we present a comprehensive review on image captioning,covering both traditional methods and recent deep learning-based techniques.Specifically,we first briefly review the early traditional works based on the retrieval and template.Then deep learning-based image captioning researches are focused,which is categorized into the encoder-decoder framework,attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction.After that,we summarize the publicly available datasets,evaluation metrics and those proposed for specific requirements,and then compare the state of the art methods on the MS COCO dataset.Finally,we provide some discussions on open challenges and future research directions.展开更多
Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short Ter...Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short TermMemory(LSTM)are used to translate an image into a natural language description.Among these approaches,the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.However,most conventional visual attention mechanisms are based on high-level image features,ignoring the effects of other image features,and giving insufficient consideration to the relative positions between image features.In this work,we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems.The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network(FPN),then utilizes the scaled-dot-product to fuse these features,which enables our model to detect objects of different scales in the image more effectivelywithout increasing parameters.In the position-aware attentionmechanism,the relative positions between image features are obtained at first,afterwards the relative positions are incorporated into the original image features to generate captions more accurately.Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4,METEOR,ROUGE-L,CIDEr scores compared with some state-of-the-art approaches,demonstrating the effectiveness of our approach.展开更多
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi...In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.展开更多
The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models...The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.展开更多
Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning...Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions.Attention-based vision transformers models have a great impact in vision field recently.In this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional layers.The third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token dimensions.The last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the image.For a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN transformer.The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.展开更多
Due to the advanced development in the multimedia-on-demandtraffic in different forms of audio, video, and images, has extremely movedon the vision of the Internet of Things (IoT) from scalar to Internet ofMultimedia ...Due to the advanced development in the multimedia-on-demandtraffic in different forms of audio, video, and images, has extremely movedon the vision of the Internet of Things (IoT) from scalar to Internet ofMultimedia Things (IoMT). Since Unmanned Aerial Vehicles (UAVs) generates a massive quantity of the multimedia data, it becomes a part of IoMT,which are commonly employed in diverse application areas, especially forcapturing remote sensing (RS) images. At the same time, the interpretationof the captured RS image also plays a crucial issue, which can be addressedby the multi-label classification and Computational Linguistics based imagecaptioning techniques. To achieve this, this paper presents an efficient lowcomplexity encoding technique with multi-label classification and image captioning for UAV based RS images. The presented model primarily involves thelow complexity encoder using the Neighborhood Correlation Sequence (NCS)with a burrows wheeler transform (BWT) technique called LCE-BWT forencoding the RS images captured by the UAV. The application of NCS greatlyreduces the computation complexity and requires fewer resources for imagetransmission. Secondly, deep learning (DL) based shallow convolutional neural network for RS image classification (SCNN-RSIC) technique is presentedto determine the multiple class labels of the RS image, shows the novelty ofthe work. Finally, the Computational Linguistics based Bidirectional EncoderRepresentations from Transformers (BERT) technique is applied for imagecaptioning, to provide a proficient textual description of the RS image. Theperformance of the presented technique is tested using the UCM dataset. Thesimulation outcome implied that the presented model has obtained effectivecompression performance, reconstructed image quality, classification results,and image captioning outcome.展开更多
Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual semantics.Almost all methods first extract image features to reduce the dif...Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual semantics.Almost all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent sentences.The Convolutional Neural Network(CNN)is often used to extract image features in image captioning,and the use of object detection networks to extract region features has achieved great success.However,the region features retrieved by this method are object-level and do not pay attention to fine-grained details because of the detection model’s limitation.We offer an approach to address this issue that more properly generates captions by fusing fine-grained features and region features.First,we extract fine-grained features using a panoramic segmentation algorithm.Second,we suggest two fusion methods and contrast their fusion outcomes.An X-linear Attention Network(X-LAN)serves as the foundation for both fusion methods.According to experimental findings on the COCO dataset,the two-branch fusion approach is superior.It is important to note that on the COCO Karpathy test split,CIDEr is increased up to 134.3%in comparison to the baseline,highlighting the potency and viability of our method.展开更多
The problem of producing a natural language description of an image for describing the visual content has gained more attention in natural language processing(NLP)and computer vision(CV).It can be driven by applicatio...The problem of producing a natural language description of an image for describing the visual content has gained more attention in natural language processing(NLP)and computer vision(CV).It can be driven by applications like image retrieval or indexing,virtual assistants,image understanding,and support of visually impaired people(VIP).Though the VIP uses other senses,touch and hearing,for recognizing objects and events,the quality of life of those persons is lower than the standard level.Automatic Image captioning generates captions that will be read loudly to the VIP,thereby realizing matters happening around them.This article introduces a Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System(RDOAI-ICS)for Visually Impaired People.The presented RDOAI-ICS technique aids in generating image captions for VIPs.The presented RDOAIICS technique utilizes a neural architectural search network(NASNet)model to produce image representations.Besides,the RDOAI-ICS technique uses the radial basis function neural network(RBFNN)method to generate a textual description.To enhance the performance of the RDOAI-ICS method,the parameter optimization process takes place using the RDO algorithm for NasNet and the butterfly optimization algorithm(BOA)for the RBFNN model,showing the novelty of the work.The experimental evaluation of the RDOAI-ICS method can be tested using a benchmark dataset.The outcomes show the enhancements of the RDOAI-ICS method over other recent Image captioning approaches.展开更多
Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of Computer Vision(CV)and Natural Language Processing(NLP)for generating the image descriptions...Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of Computer Vision(CV)and Natural Language Processing(NLP)for generating the image descriptions.Itfinds use in several application areas namely recommendation in editing applications,utilization in virtual assistance,etc.The development of NLP and deep learning(DL)modelsfind useful to derive a bridge among the visual details and textual semantics.In this view,this paper introduces an Oppositional Harris Hawks Optimization with Deep Learning based Image Captioning(OHHO-DLIC)technique.The OHHO-DLIC technique involves the design of distinct levels of pre-processing.Moreover,the feature extraction of the images is carried out by the use of EfficientNet model.Furthermore,the image captioning is performed by bidirectional long short term memory(BiLSTM)model,comprising encoder as well as decoder.At last,the oppositional Harris Hawks optimization(OHHO)based hyperparameter tuning process is performed for effectively adjusting the hyperparameter of the EfficientNet and BiLSTM models.The experimental analysis of the OHHO-DLIC technique is carried out on the Flickr 8k Dataset and a comprehensive comparative analysis highlighted the better performance over the recent approaches.展开更多
Existing image captioning models usually build the relation between visual information and words to generate captions,which lack spatial infor-mation and object classes.To address the issue,we propose a novel Position...Existing image captioning models usually build the relation between visual information and words to generate captions,which lack spatial infor-mation and object classes.To address the issue,we propose a novel Position-Class Awareness Transformer(PCAT)network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes.In our proposal,we construct our PCAT network by proposing a novel Grid Mapping Position Encoding(GMPE)method and refining the encoder-decoder framework.First,GMPE includes mapping the regions of objects to grids,calculating the relative distance among objects and quantization.Meanwhile,we also improve the Self-attention to adapt the GMPE.Then,we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes,which is employed to facilitate embedding features and refining the encoder-decoder framework.To capture the interaction between multi-modal features,we propose Object Classes Awareness(OCA)to refine the encoder and decoder,namely OCAE and OCAD,respectively.Finally,we apply GMPE,OCAE and OCAD to form various combinations and to complete the entire PCAT.We utilize the MSCOCO dataset to evaluate the performance of our method.The results demonstrate that PCAT outperforms the other competitive methods.展开更多
One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical archite...One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations.展开更多
Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision(CV)and natural language processing(NLP).At the intersection of CV and NLP is the problem of image capti...Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision(CV)and natural language processing(NLP).At the intersection of CV and NLP is the problem of image captioning,where the related models′robustness against adversarial attacks has not been well studied.This paper presents a novel adversarial attack strategy,attention-based image captioning attack(AICAttack),designed to attack image captioning models through subtle perturbations to images.Operating within a black-box attack scenario,our algorithm requires no access to the target model′s architecture,parameters,or gradient information.We introduce an attention-based candidate selection mechanism that identifies the optimal pixels for attack,followed by a customized differential evolution method to optimize the perturbations of the pixels′RGB values.We demonstrate AICAttack′s effectiveness through extensive experiments on benchmark datasets against multiple victim models.The experimental results demonstrate that our method outperforms current leading-edge techniques by achieving consistently higher attack success rates.展开更多
Image Captioning is a cross-modal task that needs to automatically generate coherent natural sentences to describe the image contents.Due to the large gap between vision and language modalities,most of the existing me...Image Captioning is a cross-modal task that needs to automatically generate coherent natural sentences to describe the image contents.Due to the large gap between vision and language modalities,most of the existing methods have the problem of inaccurate semantic matching between images and generated captions.To solve the problem,this paper proposes a novel multi-level similarity-guided semantic matching method for image captioning,which can fuse local and global semantic similarities to learn the latent semantic correlation between images and generated captions.Specifically,we extract the semantic units containing fine-grained semantic information of images and generated captions,respectively.Based on the comparison of the semantic units,we design a local semantic similarity evaluation mechanism.Meanwhile,we employ the CIDEr score to characterize the global semantic similarity.The local and global two-level similarities are finally fused using the reinforcement learning theory,to guide the model optimization to obtain better semantic matching.The quantitative and qualitative experiments on large-scale MSCOCO dataset illustrate the superiority of the proposed method,which can achieve fine-grained semantic matching of images and generated captions.展开更多
We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generatio...We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generation,which ignores the gap in image feature representations between different computer vision tasks.To address this problem,our method aims to utilize the similarity between image caption and pix-to-pix inverting tasks to ease the feature representation gap.Specifically,our framework consists of two modules:1)The pix2pix module(P2PM),which has a share learning feature extractor to extract feature representations and a U-net architecture to encode the image to latent code and then decodes them to the original image.2)The natural language generation module(NLGM)generates descriptions from feature representations extracted by P2PM.Consequently,the feature representations and generated image captions are improved during the collaborative learning process.The experimental results on the MSCOCO 2017 dataset prove the effectiveness of our approach compared to other comparison methods.展开更多
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua...Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.展开更多
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Image captioning has seen significant research efforts over the last decade.The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate.Many real-world applications rely on image captioning,such as helping people with visual impairments to see their surroundings.To formulate a coherent and relevant textual description,computer vision techniques are utilized to comprehend the visual content within an image,followed by natural language processing methods.Numerous approaches and models have been developed to deal with this multifaceted problem.Several models prove to be stateof-the-art solutions in this field.This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation.Rather than reviewing all previous image captioning work,we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements,including encompassing image captioning with visual attention methods,exploring semantic information types in captions,and employing multi-caption generation techniques.Further,advancements such as neural architecture search,few-shot learning,multi-phase learning,and cross-modal embedding within image caption networks are examined for their transformative effects.The comprehensive quantitative analysis conducted in this study identifies cutting-edgemethodologies and sheds light on their profound impact,driving forward the forefront of image captioning technology.
基金supported by the Natural Science Foundation of China(62473105,62172118)Nature Science Key Foundation of Guangxi(2021GXNSFDA196002)+1 种基金in part by the Guangxi Key Laboratory of Image and Graphic Intelligent Processing under Grants(GIIP2302,GIIP2303,GIIP2304)Innovation Project of Guang Xi Graduate Education(2024YCXB09,2024YCXS039).
文摘Existing Transformer-based image captioning models typically rely on the self-attention mechanism to capture long-range dependencies,which effectively extracts and leverages the global correlation of image features.However,these models still face challenges in effectively capturing local associations.Moreover,since the encoder extracts global and local association features that focus on different semantic information,semantic noise may occur during the decoding stage.To address these issues,we propose the Local Relationship Enhanced Gated Transformer(LREGT).In the encoder part,we introduce the Local Relationship Enhanced Encoder(LREE),whose core component is the Local Relationship Enhanced Module(LREM).LREM consists of two novel designs:the Local Correlation Perception Module(LCPM)and the Local-Global Fusion Module(LGFM),which are beneficial for generating a comprehensive feature representation that integrates both global and local information.In the decoder part,we propose the Dual-level Multi-branch Gated Decoder(DMGD).It first creates multiple decoding branches to generate multi-perspective contextual feature representations.Subsequently,it employs the Dual-Level Gating Mechanism(DLGM)to model the multi-level relationships of these multi-perspective contextual features,enhancing their fine-grained semantics and intrinsic relationship representations.This ultimately leads to the generation of high-quality and semantically rich image captions.Experiments on the standard MSCOCO dataset demonstrate that LREGT achieves state-of-the-art performance,with a CIDEr score of 140.8 and BLEU-4 score of 41.3,significantly outperforming existing mainstream methods.These results highlight LREGT’s superiority in capturing complex visual relationships and resolving semantic noise during decoding.
文摘Background The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce.However,owing to the complexity and diversity of fashion images,this task entails multiple challenges,including the lack of fine-grained captions and confounders caused by dataset bias.Specifically,confounders often cause models to learn spurious correlations,thereby reducing their generalization capabilities.Method In this work,we propose the Deconfounded Fashion Image Captioning(DFIC)framework,which first uses multimodal retrieval to enrich the predicted captions of clothing,and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding.Multimodal retrieval is used to obtain semantic words related to image features,which are input into the decoder as prompt words to enrich sentence descriptions.In the decoder,causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.Results Overall,our method can not only effectively enrich the captions of target images,but also greatly reduce confounders caused by the dataset.To verify the effectiveness of the proposed framework,the model was experimentally verified using the FACAD dataset.
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image,improving the effectiveness of identifying relevant image regions at each step of caption generation.However,providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features.Consequently,this leads to enhanced captioning network performance.In light of this,we present an image captioning framework that efficiently exploits the extracted representations of the image.Our framework comprises three key components:the Visual Feature Detector module(VFD),the Visual Feature Visual Attention module(VFVA),and the language model.The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features,creating an updated visual features matrix.Subsequently,the VFVA directs its attention to the visual features matrix generated by the VFD,resulting in an updated context vector employed by the language model to generate an informative description.Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features,thereby contributing to enhancing the image captioning model’s performance.Using the MS-COCO dataset,our experiments show that the proposed framework competes well with state-of-the-art methods,effectively leveraging visual representations to improve performance.The implementation code can be found here:https://github.com/althobhani/VFDICM(accessed on 30 July 2024).
基金funded by Researchers Supporting Project number(RSPD2024R698),King Saud University,Riyadh,Saudi Arabia.
文摘The process of generating descriptive captions for images has witnessed significant advancements in last years,owing to the progress in deep learning techniques.Despite significant advancements,the task of thoroughly grasping image content and producing coherent,contextually relevant captions continues to pose a substantial challenge.In this paper,we introduce a novel multimodal method for image captioning by integrating three powerful deep learning architectures:YOLOv8(You Only Look Once)for robust object detection,EfficientNetB7 for efficient feature extraction,and Transformers for effective sequence modeling.Our proposed model combines the strengths of YOLOv8 in detecting objects,the superior feature representation capabilities of EfficientNetB7,and the contextual understanding and sequential generation abilities of Transformers.We conduct extensive experiments on standard benchmark datasets to evaluate the effectiveness of our approach,demonstrating its ability to generate informative and semantically rich captions for diverse images.The experimental results showcase the synergistic benefits of integrating YOLOv8,EfficientNetB7,and Transformers in advancing the state-of-the-art in image captioning tasks.The proposed multimodal approach has yielded impressive outcomes,generating informative and semantically rich captions for a diverse range of images.By combining the strengths of YOLOv8,EfficientNetB7,and Transformers,the model has achieved state-of-the-art results in image captioning tasks.The significance of this approach lies in its ability to address the challenging task of generating coherent and contextually relevant captions while achieving a comprehensive understanding of image content.The integration of three powerful deep learning architectures demonstrates the synergistic benefits of multimodal fusion in advancing the state-of-the-art in image captioning.Furthermore,this approach has a profound impact on the field,opening up new avenues for research in multimodal deep learning and paving the way for more sophisticated and context-aware image captioning systems.These systems have the potential to make significant contributions to various fields,encompassing human-computer interaction,computer vision and natural language processing.
基金supported by Beijing Natural Science Foundation of China(L201023)the Natural Science Foundation of China(62076030)。
文摘Image captioning refers to automatic generation of descriptive texts according to the visual content of images.It is a technique integrating multiple disciplines including the computer vision(CV),natural language processing(NLP)and artificial intelligence.In recent years,substantial research efforts have been devoted to generate image caption with impressive progress.To summarize the recent advances in image captioning,we present a comprehensive review on image captioning,covering both traditional methods and recent deep learning-based techniques.Specifically,we first briefly review the early traditional works based on the retrieval and template.Then deep learning-based image captioning researches are focused,which is categorized into the encoder-decoder framework,attention mechanism and training strategies on the basis of model structures and training manners for a detailed introduction.After that,we summarize the publicly available datasets,evaluation metrics and those proposed for specific requirements,and then compare the state of the art methods on the MS COCO dataset.Finally,we provide some discussions on open challenges and future research directions.
基金This work was supported in part by the National Natural Science Foundation of China under Grant No.61977018the Deanship of Scientific Research at King Saud University,Riyadh,Saudi Arabia for funding this work through research Group No.RG-1438-070in part by the Research Foundation of Education Bureau of Hunan Province of China under Grant 16B006.
文摘Image captioning aims to generate a corresponding description of an image.In recent years,neural encoder-decodermodels have been the dominant approaches,in which the Convolutional Neural Network(CNN)and Long Short TermMemory(LSTM)are used to translate an image into a natural language description.Among these approaches,the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.However,most conventional visual attention mechanisms are based on high-level image features,ignoring the effects of other image features,and giving insufficient consideration to the relative positions between image features.In this work,we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems.The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network(FPN),then utilizes the scaled-dot-product to fuse these features,which enables our model to detect objects of different scales in the image more effectivelywithout increasing parameters.In the position-aware attentionmechanism,the relative positions between image features are obtained at first,afterwards the relative positions are incorporated into the original image features to generate captions more accurately.Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4,METEOR,ROUGE-L,CIDEr scores compared with some state-of-the-art approaches,demonstrating the effectiveness of our approach.
基金supported by the National Natural Science Foundation of China (61702528,61806212)。
文摘In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.
基金Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2022R161)PrincessNourah bint Abdulrahman University,Riyadh,Saudi Arabia.The authors would like to thank the|Deanship of Scientific Research at Umm Al-Qura University|for supporting this work by Grant Code:(22UQU4310373DSR33).
文摘The recent developments in Multimedia Internet of Things(MIoT)devices,empowered with Natural Language Processing(NLP)model,seem to be a promising future of smart devices.It plays an important role in industrial models such as speech understanding,emotion detection,home automation,and so on.If an image needs to be captioned,then the objects in that image,its actions and connections,and any silent feature that remains under-projected or missing from the images should be identified.The aim of the image captioning process is to generate a caption for image.In next step,the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct.In this scenario,computer vision model is used to identify the objects and NLP approaches are followed to describe the image.The current study develops aNatural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System(NLPODL-IICS).The aim of the presented NLPODL-IICS model is to produce a proper description for input image.To attain this,the proposed NLPODL-IICS follows two stages such as encoding and decoding processes.Initially,at the encoding side,the proposed NLPODL-IICS model makes use of Hunger Games Search(HGS)with Neural Search Architecture Network(NASNet)model.This model represents the input data appropriately by inserting it into a predefined length vector.Besides,during decoding phase,Chimp Optimization Algorithm(COA)with deeper Long Short Term Memory(LSTM)approach is followed to concatenate the description sentences 4436 CMC,2023,vol.74,no.2 produced by the method.The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively.The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets.Awidespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.
文摘Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions.Attention-based vision transformers models have a great impact in vision field recently.In this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO(self-distillation with no labels).The second is PVT(Pyramid Vision Transformer)which is a vision transformer that is not using convolutional layers.The third is XCIT(cross-Covariance Image Transformer)which changes the operation in self-attention by focusing on feature dimension instead of token dimensions.The last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the image.For a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN transformer.The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.
基金The authors extend their appreciation to the Deputyship for Research&Innovation,Ministry of Education in Saudi Arabia for funding this research work through the Project Number(IFPIP-941-137-1442)and King Abdulaziz University,DSR,Jeddah,Saudi Arabia.
文摘Due to the advanced development in the multimedia-on-demandtraffic in different forms of audio, video, and images, has extremely movedon the vision of the Internet of Things (IoT) from scalar to Internet ofMultimedia Things (IoMT). Since Unmanned Aerial Vehicles (UAVs) generates a massive quantity of the multimedia data, it becomes a part of IoMT,which are commonly employed in diverse application areas, especially forcapturing remote sensing (RS) images. At the same time, the interpretationof the captured RS image also plays a crucial issue, which can be addressedby the multi-label classification and Computational Linguistics based imagecaptioning techniques. To achieve this, this paper presents an efficient lowcomplexity encoding technique with multi-label classification and image captioning for UAV based RS images. The presented model primarily involves thelow complexity encoder using the Neighborhood Correlation Sequence (NCS)with a burrows wheeler transform (BWT) technique called LCE-BWT forencoding the RS images captured by the UAV. The application of NCS greatlyreduces the computation complexity and requires fewer resources for imagetransmission. Secondly, deep learning (DL) based shallow convolutional neural network for RS image classification (SCNN-RSIC) technique is presentedto determine the multiple class labels of the RS image, shows the novelty ofthe work. Finally, the Computational Linguistics based Bidirectional EncoderRepresentations from Transformers (BERT) technique is applied for imagecaptioning, to provide a proficient textual description of the RS image. Theperformance of the presented technique is tested using the UCM dataset. Thesimulation outcome implied that the presented model has obtained effectivecompression performance, reconstructed image quality, classification results,and image captioning outcome.
基金supported in part by the National Natural Science Foundation of China(NSFC)under Grant 6150140in part by the Youth Innovation Project(21032158-Y)of Zhejiang Sci-Tech University.
文摘Image captioning involves two different major modalities(image and sentence)that convert a given image into a language that adheres to visual semantics.Almost all methods first extract image features to reduce the difficulty of visual semantic embedding and then use the caption model to generate fluent sentences.The Convolutional Neural Network(CNN)is often used to extract image features in image captioning,and the use of object detection networks to extract region features has achieved great success.However,the region features retrieved by this method are object-level and do not pay attention to fine-grained details because of the detection model’s limitation.We offer an approach to address this issue that more properly generates captions by fusing fine-grained features and region features.First,we extract fine-grained features using a panoramic segmentation algorithm.Second,we suggest two fusion methods and contrast their fusion outcomes.An X-linear Attention Network(X-LAN)serves as the foundation for both fusion methods.According to experimental findings on the COCO dataset,the two-branch fusion approach is superior.It is important to note that on the COCO Karpathy test split,CIDEr is increased up to 134.3%in comparison to the baseline,highlighting the potency and viability of our method.
基金The authors extend their appreciation to the King Salman center for Disability Research for funding this work through Research Group no KSRG-2022-017.
文摘The problem of producing a natural language description of an image for describing the visual content has gained more attention in natural language processing(NLP)and computer vision(CV).It can be driven by applications like image retrieval or indexing,virtual assistants,image understanding,and support of visually impaired people(VIP).Though the VIP uses other senses,touch and hearing,for recognizing objects and events,the quality of life of those persons is lower than the standard level.Automatic Image captioning generates captions that will be read loudly to the VIP,thereby realizing matters happening around them.This article introduces a Red Deer Optimization with Artificial Intelligence Enabled Image Captioning System(RDOAI-ICS)for Visually Impaired People.The presented RDOAI-ICS technique aids in generating image captions for VIPs.The presented RDOAIICS technique utilizes a neural architectural search network(NASNet)model to produce image representations.Besides,the RDOAI-ICS technique uses the radial basis function neural network(RBFNN)method to generate a textual description.To enhance the performance of the RDOAI-ICS method,the parameter optimization process takes place using the RDO algorithm for NasNet and the butterfly optimization algorithm(BOA)for the RBFNN model,showing the novelty of the work.The experimental evaluation of the RDOAI-ICS method can be tested using a benchmark dataset.The outcomes show the enhancements of the RDOAI-ICS method over other recent Image captioning approaches.
基金supported by the Soonchunhyang University Research Fund andUniversity Innovation Support Project.
文摘Image Captioning is an emergent topic of research in the domain of artificial intelligence(AI).It utilizes an integration of Computer Vision(CV)and Natural Language Processing(NLP)for generating the image descriptions.Itfinds use in several application areas namely recommendation in editing applications,utilization in virtual assistance,etc.The development of NLP and deep learning(DL)modelsfind useful to derive a bridge among the visual details and textual semantics.In this view,this paper introduces an Oppositional Harris Hawks Optimization with Deep Learning based Image Captioning(OHHO-DLIC)technique.The OHHO-DLIC technique involves the design of distinct levels of pre-processing.Moreover,the feature extraction of the images is carried out by the use of EfficientNet model.Furthermore,the image captioning is performed by bidirectional long short term memory(BiLSTM)model,comprising encoder as well as decoder.At last,the oppositional Harris Hawks optimization(OHHO)based hyperparameter tuning process is performed for effectively adjusting the hyperparameter of the EfficientNet and BiLSTM models.The experimental analysis of the OHHO-DLIC technique is carried out on the Flickr 8k Dataset and a comprehensive comparative analysis highlighted the better performance over the recent approaches.
基金supported by the National Key Research and Development Program of China[No.2021YFB2206200].
文摘Existing image captioning models usually build the relation between visual information and words to generate captions,which lack spatial infor-mation and object classes.To address the issue,we propose a novel Position-Class Awareness Transformer(PCAT)network which can serve as a bridge between the visual features and captions by embedding spatial information and awareness of object classes.In our proposal,we construct our PCAT network by proposing a novel Grid Mapping Position Encoding(GMPE)method and refining the encoder-decoder framework.First,GMPE includes mapping the regions of objects to grids,calculating the relative distance among objects and quantization.Meanwhile,we also improve the Self-attention to adapt the GMPE.Then,we propose a Classes Semantic Quantization strategy to extract semantic information from the object classes,which is employed to facilitate embedding features and refining the encoder-decoder framework.To capture the interaction between multi-modal features,we propose Object Classes Awareness(OCA)to refine the encoder and decoder,namely OCAE and OCAD,respectively.Finally,we apply GMPE,OCAE and OCAD to form various combinations and to complete the entire PCAT.We utilize the MSCOCO dataset to evaluate the performance of our method.The results demonstrate that PCAT outperforms the other competitive methods.
文摘One of the issues in Computer Vision is the automatic development of descriptions for images,sometimes known as image captioning.Deep Learning techniques have made significant progress in this area.The typical architecture of image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem.This paper aims to find optimized models for these two subsystems.For the image feature extraction subsystem,the research tested eight different concatenations of pairs of vision models to get among them the most expressive extracted feature vector of the image.For the caption generation lingual subsystem,this paper tested three different pre-trained language embedding models:Glove(Global Vectors for Word Representation),BERT(Bidirectional Encoder Representations from Transformers),and TaCL(Token-aware Contrastive Learning),to select from them the most accurate pre-trained language embedding model.Our experiments showed that building an image captioning system that uses a concatenation of the two Transformer based models SWIN(Shiftedwindow)and PVT(PyramidVision Transformer)as an image feature extractor,combined with the TaCL language embedding model is the best result among the other combinations.
基金funding enabled and organized by CAUL and its Member Institutions.
文摘Recent advances in deep learning research have shown remarkable achievements across many tasks in computer vision(CV)and natural language processing(NLP).At the intersection of CV and NLP is the problem of image captioning,where the related models′robustness against adversarial attacks has not been well studied.This paper presents a novel adversarial attack strategy,attention-based image captioning attack(AICAttack),designed to attack image captioning models through subtle perturbations to images.Operating within a black-box attack scenario,our algorithm requires no access to the target model′s architecture,parameters,or gradient information.We introduce an attention-based candidate selection mechanism that identifies the optimal pixels for attack,followed by a customized differential evolution method to optimize the perturbations of the pixels′RGB values.We demonstrate AICAttack′s effectiveness through extensive experiments on benchmark datasets against multiple victim models.The experimental results demonstrate that our method outperforms current leading-edge techniques by achieving consistently higher attack success rates.
基金supported in part by the National Natural Science Foundation of China(62002257)the China Postdoctoral Science Foundation(2021M692395).
文摘Image Captioning is a cross-modal task that needs to automatically generate coherent natural sentences to describe the image contents.Due to the large gap between vision and language modalities,most of the existing methods have the problem of inaccurate semantic matching between images and generated captions.To solve the problem,this paper proposes a novel multi-level similarity-guided semantic matching method for image captioning,which can fuse local and global semantic similarities to learn the latent semantic correlation between images and generated captions.Specifically,we extract the semantic units containing fine-grained semantic information of images and generated captions,respectively.Based on the comparison of the semantic units,we design a local semantic similarity evaluation mechanism.Meanwhile,we employ the CIDEr score to characterize the global semantic similarity.The local and global two-level similarities are finally fused using the reinforcement learning theory,to guide the model optimization to obtain better semantic matching.The quantitative and qualitative experiments on large-scale MSCOCO dataset illustrate the superiority of the proposed method,which can achieve fine-grained semantic matching of images and generated captions.
基金supported by grant of no.61862050 from the National Nature Science Foundation of China and no.2020AAC03031 from Natural Science Foundation of Ningxia,China.
文摘We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generation,which ignores the gap in image feature representations between different computer vision tasks.To address this problem,our method aims to utilize the similarity between image caption and pix-to-pix inverting tasks to ease the feature representation gap.Specifically,our framework consists of two modules:1)The pix2pix module(P2PM),which has a share learning feature extractor to extract feature representations and a U-net architecture to encode the image to latent code and then decodes them to the original image.2)The natural language generation module(NLGM)generates descriptions from feature representations extracted by P2PM.Consequently,the feature representations and generated image captions are improved during the collaborative learning process.The experimental results on the MSCOCO 2017 dataset prove the effectiveness of our approach compared to other comparison methods.
基金supported in part by the National Natural Science Foundation of China,No.62101136Shanghai Sailing Program,No.21YF1402800National Institutes of Health,Nos.R01CA237267,R01HL151561,R01EB031102,and R01EB032716.
文摘Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.