期刊文献+

基于探针引导的视觉语言多模态解释方法

Probe-Based Multi-Modal Explanation Method for Visual Question Answering
在线阅读 下载PDF
导出
摘要 现有的视觉语言模型大部分通过具有“黑盒”结构的深度神经网络实现跨模态推理,然而网络内部的执行过程难以被人类直观理解。因此,本文侧重于研究面向视觉问答(Visual Question Answering,VQA)任务的自然语言解释(Natural Language Explanation,NLE)方法,旨在通过生成的自然语言语句来解释模型的推理过程。虽然现有方法已经取得了一定的进展,但仍面临以下挑战:(1)答案的预测过程和解释的生成过程相互干扰,弱化了解释的忠实性。(2)现有的方法仅能生成单模态的解释,存在由于指代模糊导致的语义歧义问题。为此,我们新提出了一种面向视觉问答推理过程的多模态解释方法(Probe-based Multi-modal Explanation method,PME),该方法能从推理过程的每个隐藏层状态提取信息且不影响原推理路径,确保了解释过程对原有推理过程的忠实性。另外,我们使用伪标签方法融合了VQA-X数据集与GQA数据集,在保证忠实性的前提下实现了多模态解释,缓解了单一模态文本解释中对目标的指代语义模糊问题。本文在视觉问答数据集VQA-X和A-OKVQA上将PME和其他最新最优的模型进行了性能比较,实验结果表明PME方法在相应测试集上获得了更高的解释评估分数。我们期待我们工作能够为网络模型的内部理解提供一个新的研究基础。代码位于:https://github.com/LouisJacky/LAVIS_PME。 Recently,visual-language models have achieved remarkable performance on various complex tasks such as Visual Question Answering(VQA),Image Captioning,and Referring Expression Comprehension(REC).However,these models mostly adopt a"black-box"structure with deep neural networks,making their inference processes di ff icult to understand intuitively.To explore the internal mechanisms of neural networks based models,researchers have proposed various interpretation methods,including gradient activation,saliency maps,and natural language explanations.Among these approaches,Natural Language Explanation(NLE)has gained significant attention in the visual-language community,particularly for VQA tasks(VQA-NLE),as it provides human-interpretable explanations that help understand model reasoning and develop more trustworthy deep learning systems.Existing VQA-NLE approaches can be categorized into two paradigms:post-hoc explanation methods and self-rationalization methods.Post-hoc methods typically employ separate decision and explanation models,where the decision model first predicts answer,and then the explanation model uses the predicted answer along with question to infer the reasoning process.However,this sequential and separate processing leads to a lack of direct logical connection between explanation and reasoning processes,compromising explanation reliability.In contrast,self-rationalization methods unify answer prediction and explanation generation into a single task.Despite offering simpler implementation and more reliable logical relationship modeling,such methods inevitably allow mutual interference between answer generation and explanation processes,weakening explanation faithfulness.Through extensive experiments on the VQA-X dataset,we validate such interference-with 8.38%of test samples showing different answer predictions when explanation loss is introduced during joint training,thus weakening the faithfulness of explanations.To address these challenges,we propose a novel Probe-based Multi-modal Explanation method(PME).Our method introduces a learnable probe structure that can extract information from each encoder hidden layer state during forward propagation without interfering with the original reasoning process.Experimental results confirm that our probe-based approach maintains identical answer predictions as the original model while generating faithful explanations.Additionally,to address the semantic ambiguity inherent in single-modal text explanations,we develop a cross-generation pseudo-labeling approach that enables simultaneous generation of natural language explanations and object detection boxes.This multi-modal approach significantly improves explanation clarity by providing explicit visual grounding for textual references,improving clarity scores from 67.5%to 79.2%on VQA-X and from 58.9%to 66.2%on A-OKVQA datasets.Extensive experiments demonstrate that our PME method outperforms state-of-the-art models on both VQA-X and AOKVQA benchmarks,achieving 5.5%and 2.1%improvements in CIDEr scores respectively compared to the previous best method S3C.Both quantitative and qualitative analyses validate that our method more accurately reflects the model's reasoning process while maintaining explanation faithfulness.As a model-agnostic strategy,our approach can be readily applied to other visual-language models,providing a new paradigm for developing more reliable and comprehensive model explanation methods.Code is available at:https://github.com/LouisJacky/LAVIS_PME.
作者 索伟 吕家齐 孙梦阳 刘乐 王鹏 SUO Wei;LV Jia-Qi;SUN Meng-Yang;LIU Le;WANG Peng(School of Computer Science,Northwestern Polytechnical University,Xi’an 710129;School of Cybersecurity,Northwestern Polytechnical University,Xi’an 710129)
出处 《计算机学报》 北大核心 2025年第6期1478-1494,共17页 Chinese Journal of Computers
基金 国家自然科学基金面上项目(No.62472357) 青年科学基金项目(C类)[原青年科学基金项目](No.62102323)资助。
关键词 视觉问答 自然语言解释 跨模态推理 伪标签 预训练模型 visual question answering natural language explanation cross-modal reasoning pseudo labeling pre-trained models
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部