基于探针引导的视觉语言多模态解释方法

Probe-Based Multi-Modal Explanation Method for Visual Question Answering

下载PDF

导出

摘要现有的视觉语言模型大部分通过具有“黑盒”结构的深度神经网络实现跨模态推理,然而网络内部的执行过程难以被人类直观理解。因此,本文侧重于研究面向视觉问答(Visual Question Answering,VQA)任务的自然语言解释(Natural Language Explanation,NLE)方法,旨在通过生成的自然语言语句来解释模型的推理过程。虽然现有方法已经取得了一定的进展,但仍面临以下挑战:(1)答案的预测过程和解释的生成过程相互干扰,弱化了解释的忠实性。(2)现有的方法仅能生成单模态的解释,存在由于指代模糊导致的语义歧义问题。为此,我们新提出了一种面向视觉问答推理过程的多模态解释方法(Probe-based Multi-modal Explanation method,PME),该方法能从推理过程的每个隐藏层状态提取信息且不影响原推理路径,确保了解释过程对原有推理过程的忠实性。另外,我们使用伪标签方法融合了VQA-X数据集与GQA数据集,在保证忠实性的前提下实现了多模态解释,缓解了单一模态文本解释中对目标的指代语义模糊问题。本文在视觉问答数据集VQA-X和A-OKVQA上将PME和其他最新最优的模型进行了性能比较,实验结果表明PME方法在相应测试集上获得了更高的解释评估分数。我们期待我们工作能够为网络模型的内部理解提供一个新的研究基础。代码位于:https://github.com/LouisJacky/LAVIS_PME。 Recently,visual-language models have achieved remarkable performance on various complex tasks such as Visual Question Answering(VQA),Image Captioning,and Referring Expression Comprehension(REC).However,these models mostly adopt a"black-box"structure with deep neural networks,making their inference processes di ff icult to understand intuitively.To explore the internal mechanisms of neural networks based models,researchers have proposed various interpretation methods,including gradient activation,saliency maps,and natural language explanations.Among these approaches,Natural Language Explanation(NLE)has gained significant attention in the visual-language community,particularly for VQA tasks(VQA-NLE),as it provides human-interpretable explanations that help understand model reasoning and develop more trustworthy deep learning systems.Existing VQA-NLE approaches can be categorized into two paradigms:post-hoc explanation methods and self-rationalization methods.Post-hoc methods typically employ separate decision and explanation models,where the decision model first predicts answer,and then the explanation model uses the predicted answer along with question to infer the reasoning process.However,this sequential and separate processing leads to a lack of direct logical connection between explanation and reasoning processes,compromising explanation reliability.In contrast,self-rationalization methods unify answer prediction and explanation generation into a single task.Despite offering simpler implementation and more reliable logical relationship modeling,such methods inevitably allow mutual interference between answer generation and explanation processes,weakening explanation faithfulness.Through extensive experiments on the VQA-X dataset,we validate such interference-with 8.38%of test samples showing different answer predictions when explanation loss is introduced during joint training,thus weakening the faithfulness of explanations.To address these challenges,we propose a novel Probe-based Multi-modal Explanation method(PME).Our method introduces a learnable probe structure that can extract information from each encoder hidden layer state during forward propagation without interfering with the original reasoning process.Experimental results confirm that our probe-based approach maintains identical answer predictions as the original model while generating faithful explanations.Additionally,to address the semantic ambiguity inherent in single-modal text explanations,we develop a cross-generation pseudo-labeling approach that enables simultaneous generation of natural language explanations and object detection boxes.This multi-modal approach significantly improves explanation clarity by providing explicit visual grounding for textual references,improving clarity scores from 67.5%to 79.2%on VQA-X and from 58.9%to 66.2%on A-OKVQA datasets.Extensive experiments demonstrate that our PME method outperforms state-of-the-art models on both VQA-X and AOKVQA benchmarks,achieving 5.5%and 2.1%improvements in CIDEr scores respectively compared to the previous best method S3C.Both quantitative and qualitative analyses validate that our method more accurately reflects the model's reasoning process while maintaining explanation faithfulness.As a model-agnostic strategy,our approach can be readily applied to other visual-language models,providing a new paradigm for developing more reliable and comprehensive model explanation methods.Code is available at:https://github.com/LouisJacky/LAVIS_PME.

作者索伟吕家齐孙梦阳刘乐王鹏 SUO Wei;LV Jia-Qi;SUN Meng-Yang;LIU Le;WANG Peng(School of Computer Science,Northwestern Polytechnical University,Xi’an 710129;School of Cybersecurity,Northwestern Polytechnical University,Xi’an 710129)

机构地区西北工业大学计算机学院西北工业大学网络空间安全学院

出处《计算机学报》北大核心 2025年第6期1478-1494,共17页 Chinese Journal of Computers

基金国家自然科学基金面上项目(No.62472357) 青年科学基金项目(C类)[原青年科学基金项目](No.62102323)资助。

关键词视觉问答自然语言解释跨模态推理伪标签预训练模型 visual question answering natural language explanation cross-modal reasoning pseudo labeling pre-trained models

分类号 TP182 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1地震资料分析解释方法[J].中国石油文摘,2024,40(1):58-61.
2陶俊杰,张卫锋,王玉霞,缪翌,徐领.面向视觉-语言模型的递进互提示学习[J].计算机应用研究,2025,42(6):1648-1655. 被引量：2
3地震资料分析解释方法[J].中国石油文摘,2024,40(2):64-68.
4郭伟.警务模式的理论解析:内涵、本质与意义[J].中国人民公安大学学报(社会科学版),2025,41(1):121-133. 被引量：7
5陈姿燕,史凤林.目的解释的司法适用困境检视与破解思路[J].太原师范学院学报(社会科学版),2025,24(2):52-61.
6张富,张璇.基于隐马尔科夫模型的中文分词优化方法探讨[J].测绘科学,2025,50(2):43-48. 被引量：1
7陈笔,林丰乾,周轶喆,包海斌,朱贤伟.监控视频图像技术在智慧变电站远程可视化自动巡检中运用[J].电力设备管理,2025(8):182-184.
8汤颖,周元博,孙国道.基于可解释图神经网络的可视推荐分析系统[J].计算机辅助设计与图形学学报,2025,37(4):697-712.
9丁国峰,张影.新《公司法》未届期股权转让出资义务承担规则优化——基于组织法和契约法融合视角[J].安徽理工大学学报(社会科学版),2025,27(3):17-25. 被引量：1
10程萧潇,杜依淇.多模态大语言模型与框架分析:理论、方法与实践[J].全球传媒学刊,2025,12(2):21-46. 被引量：10

计算机学报

2025年第6期

浏览历史

内容加载中请稍等...

基于探针引导的视觉语言多模态解释方法

相关作者

相关机构

相关主题

浏览历史