Knowledge-based VisualQuestion Answering(VQA)requires the integration of visual information with external knowledge reasoning.Existing approaches typically retrieve information from external corpora and rely on pretra...Knowledge-based VisualQuestion Answering(VQA)requires the integration of visual information with external knowledge reasoning.Existing approaches typically retrieve information from external corpora and rely on pretrained language models for reasoning.However,their performance is often hindered by the limited capabilities of retrievers and the constrained size of knowledge bases.Moreover,relying on image captions to bridge the modal gap between visual and language modalities can lead to the omission of critical visual details.To address these limitations,we propose the Reflective Chain-of-Thought(ReCoT)method,a simple yet effective framework inspired by metacognition theory.ReCoT effectively activates the reasoning capabilities ofMultimodal Large LanguageModels(MLLMs),providing essential visual and knowledge cues required to solve complex visual questions.It simulates a metacognitive reasoning process that encompasses monitoring,reflection,and correction.Specifically,in the initial generation stage,an MLLM produces a preliminary answer that serves as the model’s initial cognitive output.During the reflective reasoning stage,this answer is critically examined to generate a reflective rationale that integrates key visual evidence and relevant knowledge.In the final refinement stage,a smaller language model leverages this rationale to revise the initial prediction,resulting in amore accurate final answer.By harnessing the strengths ofMLLMs in visual and knowledge grounding,ReCoT enables smaller language models to reason effectively without dependence on image captions or external knowledge bases.Experimental results demonstrate that ReCoT achieves substantial performance improvements,outperforming state-of-the-art methods by 2.26%on OK-VQA and 5.8%on A-OKVQA.展开更多
Visual Question Answering(VQA)has attracted extensive research focus and has become a hot topic in deep learning recently.The development of computer vision and natural language processing technology has contributed t...Visual Question Answering(VQA)has attracted extensive research focus and has become a hot topic in deep learning recently.The development of computer vision and natural language processing technology has contributed to the advancement of this research area.Key solutions to improve the performance of VQA system exist in feature extraction,multimodal fusion,and answer prediction modules.There exists an unsolved issue in the popular VQA image feature extraction module that extracts the fine-grained features from objects of different scale difficultly.In this paper,a novel feature extraction network that combines multi-scale convolution and self-attention branches to solve the above problem is designed.Our approach achieves the state-of-the-art performance of a single model on Pascal VOC 2012,VQA 1.0,and VQA 2.0 datasets.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capab...Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capabilities,a lack of textual priors,and incomplete information fusion and interaction.This paper proposes an enhanced bootstrapping language-image pre-training(BLIP)model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention(FCA-BLIP)to address these issues.First,FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer(BERT)models to enhance feature extraction capabilities.It enables a more precise analysis of the details in images and questions.Next,the pre-trained BLIP model is used to extract features from image-text sample pairs.The model can understand the semantic relationships and shared information between images and text.Finally,a novel attention structure is developed to fuse the multimodal feature vectors,thereby improving the alignment accuracy between modalities.Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging,the proposed method outperforms the existing major models in several performance metrics.展开更多
现实场景下拍摄的视频由于存在各种未知失真类型、缺少参考视频,对此类视频的质量评价是一个十分具有挑战性的任务.近年来,研究人员将人类视觉系统的先验知识融合在质量评价任务中.在此基础上,提出一种考虑背景失真的无参考视频质量评...现实场景下拍摄的视频由于存在各种未知失真类型、缺少参考视频,对此类视频的质量评价是一个十分具有挑战性的任务.近年来,研究人员将人类视觉系统的先验知识融合在质量评价任务中.在此基础上,提出一种考虑背景失真的无参考视频质量评价方法.该方法在考虑视频内容的同时,显著增强了对视频背景中信息丢失问题的敏感度,在特征提取阶段充分考虑背景特征的提取;随后,通过引入结合门控机制的通道挖掘技术,高效整合高低维特征,使特征通道更加精准地聚焦于背景失真细节;最终,利用时序建模模块构建特征的时间维度模型,并通过线性回归方法生成视频质量的客观量化评分.使用SROCC(spearman rank order correlation coefficient)、PLCC(pearson linear correlation coefficient)和RMSE(root mean squared error)等评价指标在公开数据集KoNViD-1k、LIVE-Qualcomm和CVD2014开展实验,结果表明该方法不仅与人类主观感知具有高度相关性,且预测误差较小,有效提升了视频质量评估的准确性和可靠性,能够更贴近地模拟人类对视频质量的直观评价.展开更多
图像/视频的获取及传输过程中,由于物理环境及算法性能的限制,其质量难免会出现无法预估的衰减,导致其在实际场景中的应用受到限制,并对人的视觉体验造成显著影响。因此,作为计算机视觉领域的一项重要任务,图像/视频质量评价应运而生。...图像/视频的获取及传输过程中,由于物理环境及算法性能的限制,其质量难免会出现无法预估的衰减,导致其在实际场景中的应用受到限制,并对人的视觉体验造成显著影响。因此,作为计算机视觉领域的一项重要任务,图像/视频质量评价应运而生。其目的在于通过构建计算机数学模型来衡量图像/视频中的失真信息以判断其质量的好坏,达到自动预测质量的效果。在城市生活、交通监控以及多媒体直播等多个场景中具有广泛的应用前景。图像/视频质量评价研究取得了长足的发展,为计算机视觉领域中其他任务提供了一定的便利。本文在广泛调研前人研究的基础上,回顾了整个图像/视频质量评价领域的发展历程,分别列举了传统方法和深度学习方法中一些具有里程碑意义的算法和影响力较大的算法,然后从全参考、半参考和无参考3个方面分别对图像/视频质量评价领域的一些文献进行了综述,具体涉及的方法包含基于结构信息、基于人类视觉系统和基于自然图像统计的方法等;在LIVE(laboratory for image&video engineering)、CSIQ(categorical subjective image quality database)、TID2013等公开数据集的基础上,基于SROCC(Spearman rank order correlation coefficient)、PLCC(Pearson linear correlation coefficient)等评价指标,对一些具有代表性算法的性能进行了分析;最后总结当前质量评价领域仍存在的一些挑战与问题,并对其进行了展望。本文旨在为质量评价领域的研究人员提供一个较全面的参考。展开更多
基金supported by the National Natural Science Foundation of China(Nos.62572017,62441232,62206007)R&D Program of Beijing Municipal Education Commission(KZ202210005008).
文摘Knowledge-based VisualQuestion Answering(VQA)requires the integration of visual information with external knowledge reasoning.Existing approaches typically retrieve information from external corpora and rely on pretrained language models for reasoning.However,their performance is often hindered by the limited capabilities of retrievers and the constrained size of knowledge bases.Moreover,relying on image captions to bridge the modal gap between visual and language modalities can lead to the omission of critical visual details.To address these limitations,we propose the Reflective Chain-of-Thought(ReCoT)method,a simple yet effective framework inspired by metacognition theory.ReCoT effectively activates the reasoning capabilities ofMultimodal Large LanguageModels(MLLMs),providing essential visual and knowledge cues required to solve complex visual questions.It simulates a metacognitive reasoning process that encompasses monitoring,reflection,and correction.Specifically,in the initial generation stage,an MLLM produces a preliminary answer that serves as the model’s initial cognitive output.During the reflective reasoning stage,this answer is critically examined to generate a reflective rationale that integrates key visual evidence and relevant knowledge.In the final refinement stage,a smaller language model leverages this rationale to revise the initial prediction,resulting in amore accurate final answer.By harnessing the strengths ofMLLMs in visual and knowledge grounding,ReCoT enables smaller language models to reason effectively without dependence on image captions or external knowledge bases.Experimental results demonstrate that ReCoT achieves substantial performance improvements,outperforming state-of-the-art methods by 2.26%on OK-VQA and 5.8%on A-OKVQA.
基金This work is supported by the National Natural Science Foundation of China(61872231,61701297).
文摘Visual Question Answering(VQA)has attracted extensive research focus and has become a hot topic in deep learning recently.The development of computer vision and natural language processing technology has contributed to the advancement of this research area.Key solutions to improve the performance of VQA system exist in feature extraction,multimodal fusion,and answer prediction modules.There exists an unsolved issue in the popular VQA image feature extraction module that extracts the fine-grained features from objects of different scale difficultly.In this paper,a novel feature extraction network that combines multi-scale convolution and self-attention branches to solve the above problem is designed.Our approach achieves the state-of-the-art performance of a single model on Pascal VOC 2012,VQA 1.0,and VQA 2.0 datasets.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金Supported by the Program for Liaoning Excellent Talents in University(No.LR15045)the Liaoning Provincial Science and Technology Department Applied Basic Research Plan(No.101300243).
文摘Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capabilities,a lack of textual priors,and incomplete information fusion and interaction.This paper proposes an enhanced bootstrapping language-image pre-training(BLIP)model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention(FCA-BLIP)to address these issues.First,FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer(BERT)models to enhance feature extraction capabilities.It enables a more precise analysis of the details in images and questions.Next,the pre-trained BLIP model is used to extract features from image-text sample pairs.The model can understand the semantic relationships and shared information between images and text.Finally,a novel attention structure is developed to fuse the multimodal feature vectors,thereby improving the alignment accuracy between modalities.Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging,the proposed method outperforms the existing major models in several performance metrics.
文摘现实场景下拍摄的视频由于存在各种未知失真类型、缺少参考视频,对此类视频的质量评价是一个十分具有挑战性的任务.近年来,研究人员将人类视觉系统的先验知识融合在质量评价任务中.在此基础上,提出一种考虑背景失真的无参考视频质量评价方法.该方法在考虑视频内容的同时,显著增强了对视频背景中信息丢失问题的敏感度,在特征提取阶段充分考虑背景特征的提取;随后,通过引入结合门控机制的通道挖掘技术,高效整合高低维特征,使特征通道更加精准地聚焦于背景失真细节;最终,利用时序建模模块构建特征的时间维度模型,并通过线性回归方法生成视频质量的客观量化评分.使用SROCC(spearman rank order correlation coefficient)、PLCC(pearson linear correlation coefficient)和RMSE(root mean squared error)等评价指标在公开数据集KoNViD-1k、LIVE-Qualcomm和CVD2014开展实验,结果表明该方法不仅与人类主观感知具有高度相关性,且预测误差较小,有效提升了视频质量评估的准确性和可靠性,能够更贴近地模拟人类对视频质量的直观评价.
文摘图像/视频的获取及传输过程中,由于物理环境及算法性能的限制,其质量难免会出现无法预估的衰减,导致其在实际场景中的应用受到限制,并对人的视觉体验造成显著影响。因此,作为计算机视觉领域的一项重要任务,图像/视频质量评价应运而生。其目的在于通过构建计算机数学模型来衡量图像/视频中的失真信息以判断其质量的好坏,达到自动预测质量的效果。在城市生活、交通监控以及多媒体直播等多个场景中具有广泛的应用前景。图像/视频质量评价研究取得了长足的发展,为计算机视觉领域中其他任务提供了一定的便利。本文在广泛调研前人研究的基础上,回顾了整个图像/视频质量评价领域的发展历程,分别列举了传统方法和深度学习方法中一些具有里程碑意义的算法和影响力较大的算法,然后从全参考、半参考和无参考3个方面分别对图像/视频质量评价领域的一些文献进行了综述,具体涉及的方法包含基于结构信息、基于人类视觉系统和基于自然图像统计的方法等;在LIVE(laboratory for image&video engineering)、CSIQ(categorical subjective image quality database)、TID2013等公开数据集的基础上,基于SROCC(Spearman rank order correlation coefficient)、PLCC(Pearson linear correlation coefficient)等评价指标,对一些具有代表性算法的性能进行了分析;最后总结当前质量评价领域仍存在的一些挑战与问题,并对其进行了展望。本文旨在为质量评价领域的研究人员提供一个较全面的参考。