期刊文献+
共找到152篇文章
< 1 2 8 >
每页显示 20 50 100
Visual explainable artificial intelligence for graph‑based visual question answering and scene graph curation
1
作者 Sebastian Künzel Tanja Munz‑Körner +4 位作者 Pascal Tilli Noel Schäfer Sandeep Vidyapu Ngoc Thang Vu Daniel Weiskopf 《Visual Computing for Industry,Biomedicine,and Art》 2025年第1期133-161,共29页
This study presents a novel visualization approach to explainable artificial intelligence for graph-based visual question answering(VQA)systems.The method focuses on identifying false answer predictions by the model a... This study presents a novel visualization approach to explainable artificial intelligence for graph-based visual question answering(VQA)systems.The method focuses on identifying false answer predictions by the model and offers users the opportunity to directly correct mistakes in the input space,thus facilitating dataset curation.The decisionmaking process of the model is demonstrated by highlighting certain internal states of a graph neural network(GNN).The proposed system is built on top of a GraphVQA framework that implements various GNN-based models for VQA trained on the GQA dataset.The authors evaluated their tool through the demonstration of identified use cases,quantitative measures,and a user study conducted with experts from machine learning,visualization,and natural language processing domains.The authors’findings highlight the prominence of their implemented features in supporting the users with incorrect prediction identification and identifying the underlying issues.Additionally,their approach is easily extendable to similar models aiming at graph-based question answering. 展开更多
关键词 visual question answering Explainable artificial intelligence visual analytics Scene graphs
在线阅读 下载PDF
Medical visual question answering enhanced by multimodal feature augmentation and tri-path collaborative attention
2
作者 SUN Haocheng DUAN Yong 《High Technology Letters》 2025年第2期175-183,共9页
Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capab... Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capabilities,a lack of textual priors,and incomplete information fusion and interaction.This paper proposes an enhanced bootstrapping language-image pre-training(BLIP)model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention(FCA-BLIP)to address these issues.First,FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer(BERT)models to enhance feature extraction capabilities.It enables a more precise analysis of the details in images and questions.Next,the pre-trained BLIP model is used to extract features from image-text sample pairs.The model can understand the semantic relationships and shared information between images and text.Finally,a novel attention structure is developed to fuse the multimodal feature vectors,thereby improving the alignment accuracy between modalities.Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging,the proposed method outperforms the existing major models in several performance metrics. 展开更多
关键词 MULTIMODAL deep learning visual question answering(VQA) feature extraction attention mechanism
在线阅读 下载PDF
A survey of deep learning-based visual question answering 被引量:1
3
作者 HUANG Tong-yuan YANG Yu-ling YANG Xue-jiao 《Journal of Central South University》 SCIE EI CAS CSCD 2021年第3期728-746,共19页
With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significanc... With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significance and practical application value.Therefore,it is necessary to summarize the current research and provide some reference for researchers in this field.This article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering field.First,relevant background knowledge about VQA(Visual Question Answering)was introduced.Secondly,the issues and challenges of visual question answering were discussed,and at the same time,some promising discussion on the particular methodologies was given.Thirdly,the key sub-problems affecting visual question answering were summarized and analyzed.Then,the current commonly used data sets and evaluation indicators were summarized.Next,in view of the popular algorithms and models in VQA research,comparison of the algorithms and models was summarized and listed.Finally,the future development trend and conclusion of visual question answering were prospected. 展开更多
关键词 computer vision natural language processing visual question answering deep learning attention mechanism
在线阅读 下载PDF
Deep Multi-Module Based Language Priors Mitigation Model for Visual Question Answering 被引量:1
4
作者 于守健 金学勤 +2 位作者 吴国文 石秀金 张红 《Journal of Donghua University(English Edition)》 CAS 2023年第6期684-694,共11页
The original intention of visual question answering(VQA)models is to infer the answer based on the relevant information of the question text in the visual image,but many VQA models often yield answers that are biased ... The original intention of visual question answering(VQA)models is to infer the answer based on the relevant information of the question text in the visual image,but many VQA models often yield answers that are biased by some prior knowledge,especially the language priors.This paper proposes a mitigation model called language priors mitigation-VQA(LPM-VQA)for the language priors problem in VQA model,which divides language priors into positive and negative language priors.Different network branches are used to capture and process the different priors to achieve the purpose of mitigating language priors.A dynamically-changing language prior feedback objective function is designed with the intermediate results of some modules in the VQA model.The weight of the loss value for each answer is dynamically set according to the strength of its language priors to balance its proportion in the total VQA loss to further mitigate the language priors.This model does not depend on the baseline VQA architectures and can be configured like a plug-in to improve the performance of the model over most existing VQA models.The experimental results show that the proposed model is general and effective,achieving state-of-the-art accuracy in the VQA-CP v2 dataset. 展开更多
关键词 visual question answering(VQA) language priors natural language processing multimodal fusion computer vision
在线阅读 下载PDF
Dual modality prompt learning for visual question-grounded answering in robotic surgery 被引量:2
5
作者 Yue Zhang Wanshu Fan +3 位作者 Peixi Peng Xin Yang Dongsheng Zhou Xiaopeng Wei 《Visual Computing for Industry,Biomedicine,and Art》 2024年第1期316-328,共13页
With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of th... With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image.This limitation restricts the interpretative capacity of the VQA models and their abil-ity to explore specific image regions.To address this issue,this study proposes a grounded VQA model for robotic surgery,capable of localizing a specific region during answer prediction.Drawing inspiration from prompt learning in language models,a dual-modality prompt model was developed to enhance precise multimodal information interactions.Specifically,two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model.A visual complementary prompter merges visual prompt knowl-edge with visual information features to guide accurate localization.The textual complementary prompter aligns vis-ual information with textual prompt knowledge and textual information,guiding textual information towards a more accurate inference of the answer.Additionally,a multiple iterative fusion strategy was adopted for comprehensive answer reasoning,to ensure high-quality generation of textual and grounded answers.The experimental results vali-date the effectiveness of the model,demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets. 展开更多
关键词 Prompt learning visual prompt Textual prompt Grounding-answering visual question answering
在线阅读 下载PDF
Improved Blending Attention Mechanism in Visual Question Answering
6
作者 Siyu Lu Yueming Ding +4 位作者 Zhengtong Yin Mingzhe Liu Xuan Liu Wenfeng Zheng Lirong Yin 《Computer Systems Science & Engineering》 SCIE EI 2023年第10期1149-1161,共13页
Visual question answering(VQA)has attracted more and more attention in computer vision and natural language processing.Scholars are committed to studying how to better integrate image features and text features to ach... Visual question answering(VQA)has attracted more and more attention in computer vision and natural language processing.Scholars are committed to studying how to better integrate image features and text features to achieve better results in VQA tasks.Analysis of all features may cause information redundancy and heavy computational burden.Attention mechanism is a wise way to solve this problem.However,using single attention mechanism may cause incomplete concern of features.This paper improves the attention mechanism method and proposes a hybrid attention mechanism that combines the spatial attention mechanism method and the channel attention mechanism method.In the case that the attention mechanism will cause the loss of the original features,a small portion of image features were added as compensation.For the attention mechanism of text features,a selfattention mechanism was introduced,and the internal structural features of sentences were strengthened to improve the overall model.The results show that attention mechanism and feature compensation add 6.1%accuracy to multimodal low-rank bilinear pooling network. 展开更多
关键词 visual question answering spatial attention mechanism channel attention mechanism image feature processing text feature extraction
在线阅读 下载PDF
MKEAH:Multimodal knowledge extraction and accumulation based on hyperplane embedding for knowledge-based visual question answering
7
作者 Heng ZHANG Zhihua WEI +6 位作者 Guanming LIU Rui WANG Ruibin MU Chuanbao LIU Aiquan YUAN Guodong CAO Ning HU 《虚拟现实与智能硬件(中英文)》 EI 2024年第4期280-291,共12页
Background External knowledge representations play an essential role in knowledge-based visual question and answering to better understand complex scenarios in the open world.Recent entity-relationship embedding appro... Background External knowledge representations play an essential role in knowledge-based visual question and answering to better understand complex scenarios in the open world.Recent entity-relationship embedding approaches are deficient in representing some complex relations,resulting in a lack of topic-related knowledge and redundancy in topic-irrelevant information.Methods To this end,we propose MKEAH:Multimodal Knowledge Extraction and Accumulation on Hyperplanes.To ensure that the lengths of the feature vectors projected onto the hyperplane compare equally and to filter out sufficient topic-irrelevant information,two losses are proposed to learn the triplet representations from the complementary views:range loss and orthogonal loss.To interpret the capability of extracting topic-related knowledge,we present the Topic Similarity(TS)between topic and entity-relations.Results Experimental results demonstrate the effectiveness of hyperplane embedding for knowledge representation in knowledge-based visual question answering.Our model outperformed state-of-the-art methods by 2.12%and 3.24%on two challenging knowledge-request datasets:OK-VQA and KRVQA,respectively.Conclusions The obvious advantages of our model in TS show that using hyperplane embedding to represent multimodal knowledge can improve its ability to extract topic-related knowledge. 展开更多
关键词 Knowledge-based visual question answering HYPERPLANE Topic-related
在线阅读 下载PDF
Bootstrapping Large Language Models with Outsideknowledge for Knowledge-based Visual Question Answering
8
作者 Yanze Min Yawei Sun +2 位作者 Yin Zhu Jun Zhu Bo Zhang 《Machine Intelligence Research》 2026年第1期115-132,共18页
Knowledge-based visual question answering(KB-VQA),requiring external world knowledge beyond the image for reasoning,is more challenging than traditional visual question answering.Recent works have demonstrated the eff... Knowledge-based visual question answering(KB-VQA),requiring external world knowledge beyond the image for reasoning,is more challenging than traditional visual question answering.Recent works have demonstrated the effectiveness of using a large(vision)language model as an implicit knowledge source to acquire the necessary information.However,the knowledge stored in large models(LMs)is often coarse-grained and inaccurate,causing questions requiring finer-grained information to be answered incorrectly.In this work,we propose a variational expectation-maximization(EM)framework that bootstraps the VQA performance of LMs with its own answer.In contrast to former VQA pipelines,we treat the outside knowledge as a latent variable.In the E-step,we approximate the posterior with two components:First,a rough answer,e.g.,a general description of the image,which is usually the strength of LMs,and second,a multi-modal neural retriever to retrieve question-specific knowledge from an external knowledge base.In the M-step,the training objective optimizes the ability of the original LMs to generate rough answers as well as refined answers based on the retrieved information.Extensive experiments show that our proposed framework,BootLM,has a strong retrieval ability and achieves state-of-the-art performance on knowledge-based VQA tasks. 展开更多
关键词 Multi-modal large language models visual question answering(VQA) knowledge retrieval graphical models machine learning
原文传递
Answer Semantics-enhanced Medical Visual Question Answering
9
作者 Yuliang Liang Enneng Yang +4 位作者 Guibing Guo Wei Cai Linying Jiang Jianzhe Zhao Xingwei Wang 《Machine Intelligence Research》 2025年第6期1127-1137,共11页
Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features o... Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features of the image and question.However,such a paradigm ignores the valuable semantics of candidate answers as well as their relations.From the real-world dataset,we observe that:1)The text of candidate answers has a strong intrinsic correlation with medical images;2)Subtle differences among multiple candidate answers are crucial for identifying the correct one.Therefore,we propose an answer semantics enhanced(ASE)method to integrate the semantics of answers and capture their subtle differences.Specifically,we enhance the semantic correlation of image-question-answer triplets by aligning images and question-answer tuples within the feature fusion module.Then,we devise a contrastive learning loss to highlight the semantic differences between the correct answer and other answers.Finally,extensive experiments demonstrate the effectiveness of our method. 展开更多
关键词 Medical visual question answering(Med-VQA) semantic knowledge vision-language model contrastive learning question answering
原文传递
Seeing and Reasoning:A Simple Deep Learning Approach to Visual Question Answering
10
作者 Rufai Yusuf Zakari Jim Wilson Owusu +2 位作者 Ke Qin Tao He Guangchun Luo 《Big Data Mining and Analytics》 2025年第2期458-478,共21页
Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements... Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step,compositional manner.We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features.The model employs an enhanced self-attention mechanism,enabling it to attend to multiple modalities simultaneously,while a co-attention unit dynamically guides focus to the most relevant image regions and question components.Additionally,a multi-step reasoning module supports iterative inference,allowing the model to excel at complex reasoning tasks.Extensive experiments on benchmark datasets demonstrate the model’s superior performance,with accuracies of 98.6%on CLEVR,63.78%on GQA,and 68.67%on VQA v2.0.Ablation studies confirm the critical contribution of key components,such as the reasoning module and co-attention mechanism,to the model’s effectiveness.Qualitative analysis of the learned attention distributions further illustrates the model’s dynamic reasoning process,adapting to task complexity.Overall,our study advances the adaptation of Transformer architectures for VQA,enhancing both reasoning capabilities and model interpretability in visual reasoning tasks. 展开更多
关键词 machine learning deep learning visual question answering(VQA) multi-step reasoning computer vision
原文传递
Zero-Shot Knowledge-Based Visual Question Answering with Frozen Language Models
11
作者 Jing Liu Lizong Zhang +3 位作者 Chenpeng Cao Yinong Shi Chong Mu Jiaxin Li 《Big Data Mining and Analytics》 2025年第6期1418-1431,共14页
Knowledge-based Visual Question Answering(VQA)is a challenging task that requires models to access external knowledge for reasoning.Large Language Models(LLMs)have recently been employed for zero-shot knowledge-based ... Knowledge-based Visual Question Answering(VQA)is a challenging task that requires models to access external knowledge for reasoning.Large Language Models(LLMs)have recently been employed for zero-shot knowledge-based VQA due to their inherent knowledge storage and in-context learning capabilities.However,LLMs are commonly perceived as implicit knowledge bases,and their generative and in-context learning potential remains underutilized.Existing works demonstrate that the performance of in-context learning strongly depends on the quality and order of demonstrations in prompts.In light of this,we propose Knowledge Generation with Frozen Language Models(KGFLM),a novel method for generating explicit knowledge statements to improve zero-shot knowledge-based VQA.Our knowledge generation strategy aims to identify effective demonstrations and determine their optimal order,thereby activating the frozen LLM to produce more useful knowledge statements for better predictions.The generated knowledge statements can also serve as interpretable rationales.In our method,the selection and arrangement of demonstrations are based on semantic similarity and quality of demonstrations for each question,without requiring additional annotations.Furthermore,a series of experiments are conducted on A-OKVQA and OKVQA datasets.The results show that our method outperforms some superior zero-shot knowledge-based VQA methods. 展开更多
关键词 knowledge-based visual question answering(VQA) zero-shot learning Large Language Models(LLMs)
原文传递
Dual-Modality Integration Attention with Graph-Based Feature Extraction for Visual Question and Answering
12
作者 Jing Lu Chunlei Wu +2 位作者 Leiquan Wang Ran Li Xiuxuan Shen 《Tsinghua Science and Technology》 2025年第5期2133-2145,共13页
Visual Question and Answering(VQA)has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses.While existing methods often rely on Convolu... Visual Question and Answering(VQA)has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses.While existing methods often rely on Convolutional Neural Networks(CNNs)for feature extraction and attention mechanisms for embedding learning,they frequently fail to capture the nuanced interactions between entities within images,leading to potential ambiguities in answer generation.In this paper,we introduce a novel network architecture,Dual-modality Integration Attention with Graph-based Feature Extraction(DIAGFE),which addresses these limitations by incorporating two key innovations:a Graph-based Feature Extraction(GFE)module that enhances the precision of visual semantics extraction,and a Dual-modality Integration Attention(DIA)mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation.Our model is trained with a composite loss function to refine its predictive accuracy.Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods,underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding. 展开更多
关键词 visual question and answering(VQA) Graph-based Feature Extraction(GFE) Dual-modality Integration Attention(DIA) composite loss
原文传递
医学视觉问答中图像与答案一致性验证方法研究
13
作者 从浩 刘利军 杨小兵 《重庆邮电大学学报(自然科学版)》 北大核心 2026年第1期118-127,共10页
针对医学视觉问答(medical visual question answering,Med-VQA)中多模态特征融合不足和图像答案不匹配导致模型准确率不高的问题,构建了图像答案一致性验证(image and answer consistency verification,IACV)模型。在预训练阶段,通过... 针对医学视觉问答(medical visual question answering,Med-VQA)中多模态特征融合不足和图像答案不匹配导致模型准确率不高的问题,构建了图像答案一致性验证(image and answer consistency verification,IACV)模型。在预训练阶段,通过结合多个预训练任务,增强模型的多模态特征提取与融合能力。在微调阶段,利用部位信息对图像进行部位划分,生成答案掩码矩阵,并对最终答案进行一致性验证,从而提升模型准确率。实验结果表明,IACV模型在公共数据集VQA-RAD和SLAKE上的准确率分别达到78.9%和84.6%,显著提高了Med-VQA任务的准确性,为后续的应用提供了更可靠的支持。 展开更多
关键词 医学视觉问答(Med-VQA) 答案掩码矩阵 一致性验证 预训练
在线阅读 下载PDF
基于细粒度特征增强的多模态视觉问答研究
14
作者 王志伟 陆振宇 《南京信息工程大学学报》 北大核心 2026年第1期35-47,共13页
现有多模态视觉问答(Visual Question Answering,VQA)模型忽略了图像中局部显著信息与文本中局部基本词之间的细粒度交互作用,图像与文本之间的语义相关性有待提高.为此,本文提出一种基于细粒度特征增强的多模态视觉问答方法.首先,对视... 现有多模态视觉问答(Visual Question Answering,VQA)模型忽略了图像中局部显著信息与文本中局部基本词之间的细粒度交互作用,图像与文本之间的语义相关性有待提高.为此,本文提出一种基于细粒度特征增强的多模态视觉问答方法.首先,对视觉和文本分别增加一种细粒度特征提取方法,以便更全面准确地提取图像和问题的语义特征;然后,为了利用不同层次模态之间的对齐信息,提出一种对齐引导的自注意力模块来对齐单一模态内(视觉或文本)细粒度特征和全局语义特征之间的对应关系,并以统一的方式融合不同层次的单模态信息;最后,在VQA v2.0和VQA-CP v2数据集上进行实验,结果表明,本文所提方法在各项视觉问答评估指标上的表现优于现有的模型. 展开更多
关键词 视觉问答 多模态 细粒度 特征增强 实体对齐 特征融合
在线阅读 下载PDF
面向遥感视觉问答的跨模态知识引入与提示推理框架
15
作者 董欣 俞鹏飞 顾晶晶 《计算机科学与探索》 北大核心 2026年第3期760-772,共13页
随着遥感技术的快速发展,遥感视觉问答(RSVQA)作为一种结合语言与视觉交互的新兴技术,显著提升了地球观测、环境监测等领域中遥感图像信息的解读效率和交互能力。然而,RSVQA仍面临遥感图像信息复杂度高、遥感图像-文本对齐数据稀缺,以... 随着遥感技术的快速发展,遥感视觉问答(RSVQA)作为一种结合语言与视觉交互的新兴技术,显著提升了地球观测、环境监测等领域中遥感图像信息的解读效率和交互能力。然而,RSVQA仍面临遥感图像信息复杂度高、遥感图像-文本对齐数据稀缺,以及文本问题表达形式多样等挑战。为了应对这些挑战,提出一种面向RSVQA的跨模态知识引入与提示推理框架(CMKIP)。针对遥感图像的高复杂度,CMKIP为大语言模型LLaMA构建可学习的图像特征适配器,以具备对复杂图像的表征能力;针对遥感图像-文本对齐数据稀缺问题,构建自动化数据生成管道,从公开遥感数据集中生成高质量的图像-文本对,实现高效的遥感领域知识注入;针对问题表达的多样性,创新性地提出一种大小模型协同推理机制,利用小模型进行知识库检索与中间推理校正,显著提升大语言模型对多样化问题的理解能力与推理准确性。此外,CMKIP支持根据任务需求灵活更换小模型,可广泛应用于遥感领域的多项下游任务。实验结果表明,CMKIP在RSVQA基准数据集上的性能显著优于现有方法,特别是在低样本场景下表现尤为突出,展示了其在RSVQA任务中的有效性和泛化性。 展开更多
关键词 遥感视觉问答 大语言模型 跨模态扩展 遥感微调指令集 轻量级模型 提示推理
在线阅读 下载PDF
基于模块化推理的态势认知思维链技术研究
16
作者 姬鸿远 卿杜政 《系统仿真学报》 北大核心 2026年第2期278-293,共16页
针对传统仿真系统中态势理解智能化不足等问题,构建了态势视觉问答数据集,提出了模块化推理框架。构建了态势认知思维链SACoT,在零样本条件下通过专家提示引导模型进行任务分解与多模态信息融合,生成推理链以增强语义认知和可解释性,提... 针对传统仿真系统中态势理解智能化不足等问题,构建了态势视觉问答数据集,提出了模块化推理框架。构建了态势认知思维链SACoT,在零样本条件下通过专家提示引导模型进行任务分解与多模态信息融合,生成推理链以增强语义认知和可解释性,提供了一种可扩展、低计算成本的解决方案。实验结果表明:SACoT优化了任务分配,使模型聚焦于问题相关的图像细节,减轻了多步推理引起的思维链碎片化,缓解了模型在长文本推理过程中的遗忘问题,验证了模块化分析在态势推理中的可行性,为AI应用于战场态势仿真融合与作战辅助决策提供了新思路和技术途径。 展开更多
关键词 基础模型 思维链 态势推理 视觉问答 零样本学习
原文传递
基于思维链推理的可控多模态决策方法
17
作者 胡宇航 王诗涵 +2 位作者 刘利龙 杨振宇 钱胜胜 《计算机学报》 北大核心 2026年第4期841-854,共14页
近年来,多模态大语言模型(Multimodal Large Language Models,MLLMs)在人工智能领域取得了显著的进展,特别是逻辑推理方面。思维链推理的出现显著提升了大语言模型的能力,尤其是在复杂推理任务中。尽管取得了这些进展,多模态大语言模型... 近年来,多模态大语言模型(Multimodal Large Language Models,MLLMs)在人工智能领域取得了显著的进展,特别是逻辑推理方面。思维链推理的出现显著提升了大语言模型的能力,尤其是在复杂推理任务中。尽管取得了这些进展,多模态大语言模型在可控推理方面仍面临挑战,特别是在视觉问答环境中。传统方法通常导致无结构的推理过程或受限于僵化的框架,限制了其适应性和有效性。为了解决这些问题,本研究提出了基于思维链推理的可控多模态决策方法(Controlled Multimodal Decision-making Method based on Chain-of-Thought reasoning,CMDM-CoT),一种旨在增强MLLMs推理能力的新型可控框架。CMDM-CoT引入了自适应问题解决决策集,使模型能够根据任务复杂性自主选择适当的推理路径,从而克服固定框架的局限性。此外,CMDM-CoT还包含状态评估机制,通过对每个推理状态进行评分,确保逻辑一致性和高质量学习。这种方法不仅促进了简单任务的最小化推理,还支持复杂问题的详细推理。值得注意的是,CMDM-CoT在应用于Llama、Qwen2-VL、InternVL2三种主流模型时表现出色,与基线模型相比,这些模型平均提高了7.3%。此外,这些模型甚至超过了参数量更大的闭源模型GPT-4V,显示出本研究的开源模型在多个基准测试中具有竞争力。 展开更多
关键词 思维链推理 多模态大语言模型 视觉问答 自适应决策机制 复杂推理
在线阅读 下载PDF
Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering 被引量:3
18
作者 Zhongjian Hu Peng Yang +2 位作者 Fengyuan Liu Yuan Meng Xingyu Liu 《Big Data Mining and Analytics》 EI CSCD 2024年第3期843-857,共15页
Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based Visual Question Answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge injection.Although metho... Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based Visual Question Answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge injection.Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks,they may have some limitations,such as the possibility of not being able to retrieve the required knowledge.In this paper,we introduce a novel framework for knowledge-based VQA titled“Prompting Large Language Models with Knowledge-Injection”(PLLMKI).We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection.Unlike earlier approaches,we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs.Furthermore,we leverage open LLMs,incurring no additional costs.In comparison to existing baselines,our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets,namely OK-VQA and A-OKVQA,respectively. 展开更多
关键词 visual question answering knowledge-based visual question answering large language model knowledge injection
原文传递
Learning a Mixture of Conditional Gating Blocks for Visual Question Answering
19
作者 Qiang Sun Yan-Wei Fu Xiang-Yang Xue 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第4期912-928,共17页
As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising way... As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network models.Unfortunately,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner.Typically,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks.To this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task.In particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic networks.The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG.We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets.Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets. 展开更多
关键词 visual question answering TRANSFORMER dynamic network
原文传递
基于规划学习的视觉故事生成模型
20
作者 王元龙 张宁倩 张虎 《计算机科学》 北大核心 2025年第9期269-275,共7页
近年来,视觉故事生成受到越来越多的计算机视觉和自然语言处理领域学者的关注。现有模型大多侧重于增强图像表示,例如引入外部知识、场景图等,虽然取得了一些进展,但生成的故事仍存在内容重复使用和细节描述少的问题。针对上述问题,提... 近年来,视觉故事生成受到越来越多的计算机视觉和自然语言处理领域学者的关注。现有模型大多侧重于增强图像表示,例如引入外部知识、场景图等,虽然取得了一些进展,但生成的故事仍存在内容重复使用和细节描述少的问题。针对上述问题,提出了基于规划学习的视觉故事生成模型1),引入规划学习方法,从主题、对象、动作、地点、推理、预测6个维度设定对应的问题,利用视觉问答预训练语言模型生成答案,完成规划设计,引导视觉故事生成。模型分为4阶段:第一阶段从图片中提取视觉信息;第二阶段通过概念生成器抽取并选择相关概念;第三阶段利用预训练语言模型引导规划信息生成;第四阶段融合前3个阶段生成的视觉、概念和规划信息,完成视觉故事生成任务。在公开数据集VIST上验证所提模型的效果,与现有模型COVS相比,其在BLEU-1,BLEU-2,ROUGE_L,Distinct-3,Distinct-4和TTR指标上提升了1.58百个分点、2.7百个分点、0.4百个分点、2.2百个分点、3.6百个分点和5.6百个分点。 展开更多
关键词 视觉故事生成 规划学习 视觉问答
在线阅读 下载PDF
上一页 1 2 8 下一页 到第
使用帮助 返回顶部