Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capab...Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capabilities,a lack of textual priors,and incomplete information fusion and interaction.This paper proposes an enhanced bootstrapping language-image pre-training(BLIP)model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention(FCA-BLIP)to address these issues.First,FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer(BERT)models to enhance feature extraction capabilities.It enables a more precise analysis of the details in images and questions.Next,the pre-trained BLIP model is used to extract features from image-text sample pairs.The model can understand the semantic relationships and shared information between images and text.Finally,a novel attention structure is developed to fuse the multimodal feature vectors,thereby improving the alignment accuracy between modalities.Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging,the proposed method outperforms the existing major models in several performance metrics.展开更多
With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significanc...With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significance and practical application value.Therefore,it is necessary to summarize the current research and provide some reference for researchers in this field.This article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering field.First,relevant background knowledge about VQA(Visual Question Answering)was introduced.Secondly,the issues and challenges of visual question answering were discussed,and at the same time,some promising discussion on the particular methodologies was given.Thirdly,the key sub-problems affecting visual question answering were summarized and analyzed.Then,the current commonly used data sets and evaluation indicators were summarized.Next,in view of the popular algorithms and models in VQA research,comparison of the algorithms and models was summarized and listed.Finally,the future development trend and conclusion of visual question answering were prospected.展开更多
The original intention of visual question answering(VQA)models is to infer the answer based on the relevant information of the question text in the visual image,but many VQA models often yield answers that are biased ...The original intention of visual question answering(VQA)models is to infer the answer based on the relevant information of the question text in the visual image,but many VQA models often yield answers that are biased by some prior knowledge,especially the language priors.This paper proposes a mitigation model called language priors mitigation-VQA(LPM-VQA)for the language priors problem in VQA model,which divides language priors into positive and negative language priors.Different network branches are used to capture and process the different priors to achieve the purpose of mitigating language priors.A dynamically-changing language prior feedback objective function is designed with the intermediate results of some modules in the VQA model.The weight of the loss value for each answer is dynamically set according to the strength of its language priors to balance its proportion in the total VQA loss to further mitigate the language priors.This model does not depend on the baseline VQA architectures and can be configured like a plug-in to improve the performance of the model over most existing VQA models.The experimental results show that the proposed model is general and effective,achieving state-of-the-art accuracy in the VQA-CP v2 dataset.展开更多
Visual question answering(VQA)has attracted more and more attention in computer vision and natural language processing.Scholars are committed to studying how to better integrate image features and text features to ach...Visual question answering(VQA)has attracted more and more attention in computer vision and natural language processing.Scholars are committed to studying how to better integrate image features and text features to achieve better results in VQA tasks.Analysis of all features may cause information redundancy and heavy computational burden.Attention mechanism is a wise way to solve this problem.However,using single attention mechanism may cause incomplete concern of features.This paper improves the attention mechanism method and proposes a hybrid attention mechanism that combines the spatial attention mechanism method and the channel attention mechanism method.In the case that the attention mechanism will cause the loss of the original features,a small portion of image features were added as compensation.For the attention mechanism of text features,a selfattention mechanism was introduced,and the internal structural features of sentences were strengthened to improve the overall model.The results show that attention mechanism and feature compensation add 6.1%accuracy to multimodal low-rank bilinear pooling network.展开更多
Background External knowledge representations play an essential role in knowledge-based visual question and answering to better understand complex scenarios in the open world.Recent entity-relationship embedding appro...Background External knowledge representations play an essential role in knowledge-based visual question and answering to better understand complex scenarios in the open world.Recent entity-relationship embedding approaches are deficient in representing some complex relations,resulting in a lack of topic-related knowledge and redundancy in topic-irrelevant information.Methods To this end,we propose MKEAH:Multimodal Knowledge Extraction and Accumulation on Hyperplanes.To ensure that the lengths of the feature vectors projected onto the hyperplane compare equally and to filter out sufficient topic-irrelevant information,two losses are proposed to learn the triplet representations from the complementary views:range loss and orthogonal loss.To interpret the capability of extracting topic-related knowledge,we present the Topic Similarity(TS)between topic and entity-relations.Results Experimental results demonstrate the effectiveness of hyperplane embedding for knowledge representation in knowledge-based visual question answering.Our model outperformed state-of-the-art methods by 2.12%and 3.24%on two challenging knowledge-request datasets:OK-VQA and KRVQA,respectively.Conclusions The obvious advantages of our model in TS show that using hyperplane embedding to represent multimodal knowledge can improve its ability to extract topic-related knowledge.展开更多
Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associate...Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associated captions published between Jan 1,2016,and Dec 31,2024,were collected from WeChat Official Accounts.Based on these captions,bilingual question-answer(QA)pairs in Chinese and English were generated using GPT-4o-mini.QA pairs were categorized into six subsets by question type and language:binary(Binary_CN,Binary_EN),single-choice(Singlechoice_CN,Single-choice_EN),and open-ended(Open-ended_CN,Open-ended_EN).The benchmark was used to evaluate six VLMs:GPT-4o,Gemini 2.0 Flash,Qwen2.5-VL-72B-Instruct,Janus-Pro-7B,InternVL3-8B,and HealthGPT-L14.Primary outcome was overall accuracy;secondary outcomes included subset-,subspeciality-,and modality-specific accuracy.Performance on open-ended questions were also quantified using languagebased metrics,including AlignScore,BARTScore,BERTScore,BLEU,CIDEr,METEOR,and ROUGE_L.Error types in open-ended responses were manually analyzed through stratified sampling.Results:OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties,548 conditions,29 imaging modalities,and 68 modality combinations.Gemini 2.0 Flash achieved the highest overall accuracy(0.555),significantly outperforming GPT-4o(0.527),Qwen2.5-VL-72B-Instruct(0.520),HealthGPTL14(0.502),InternVL3-L14(0.453),and Janus-Pro-7B(0.333)(all P<0.001).It also led in both Chinese(0.551)and English subsets(0.559).By subset,Gemini 2.0 Flash excelled in Binary_CN(0.687)and Singlechoice_CN(0.666);HealthGPT-L14 performed best in Single-choice_EN(0.739);while GPT-4o ranked highest in Binary_EN(0.717),Open-ended_CN(0.254),and Open-ended_EN(0.271).Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets.Performance varied across subspecialties and modalities,with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities.Error types analysis revealed lesion/diagnosis errors as the most frequent(35.6%-50.6%),followed by anatomical location errors(28.3%-37.5%).Conclusions:This study presents the first bilingual VQA benchmark for ophthalmology,distinguished by its realworld context and inclusion of multiple examinations per patient.The dataset enables quantitative evaluation of VLMs,supporting the development of accurate and specialized AI systems for eye care.展开更多
Knowledge-based visual question answering(KB-VQA),requiring external world knowledge beyond the image for reasoning,is more challenging than traditional visual question answering.Recent works have demonstrated the eff...Knowledge-based visual question answering(KB-VQA),requiring external world knowledge beyond the image for reasoning,is more challenging than traditional visual question answering.Recent works have demonstrated the effectiveness of using a large(vision)language model as an implicit knowledge source to acquire the necessary information.However,the knowledge stored in large models(LMs)is often coarse-grained and inaccurate,causing questions requiring finer-grained information to be answered incorrectly.In this work,we propose a variational expectation-maximization(EM)framework that bootstraps the VQA performance of LMs with its own answer.In contrast to former VQA pipelines,we treat the outside knowledge as a latent variable.In the E-step,we approximate the posterior with two components:First,a rough answer,e.g.,a general description of the image,which is usually the strength of LMs,and second,a multi-modal neural retriever to retrieve question-specific knowledge from an external knowledge base.In the M-step,the training objective optimizes the ability of the original LMs to generate rough answers as well as refined answers based on the retrieved information.Extensive experiments show that our proposed framework,BootLM,has a strong retrieval ability and achieves state-of-the-art performance on knowledge-based VQA tasks.展开更多
Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements...Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step,compositional manner.We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features.The model employs an enhanced self-attention mechanism,enabling it to attend to multiple modalities simultaneously,while a co-attention unit dynamically guides focus to the most relevant image regions and question components.Additionally,a multi-step reasoning module supports iterative inference,allowing the model to excel at complex reasoning tasks.Extensive experiments on benchmark datasets demonstrate the model’s superior performance,with accuracies of 98.6%on CLEVR,63.78%on GQA,and 68.67%on VQA v2.0.Ablation studies confirm the critical contribution of key components,such as the reasoning module and co-attention mechanism,to the model’s effectiveness.Qualitative analysis of the learned attention distributions further illustrates the model’s dynamic reasoning process,adapting to task complexity.Overall,our study advances the adaptation of Transformer architectures for VQA,enhancing both reasoning capabilities and model interpretability in visual reasoning tasks.展开更多
Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features o...Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features of the image and question.However,such a paradigm ignores the valuable semantics of candidate answers as well as their relations.From the real-world dataset,we observe that:1)The text of candidate answers has a strong intrinsic correlation with medical images;2)Subtle differences among multiple candidate answers are crucial for identifying the correct one.Therefore,we propose an answer semantics enhanced(ASE)method to integrate the semantics of answers and capture their subtle differences.Specifically,we enhance the semantic correlation of image-question-answer triplets by aligning images and question-answer tuples within the feature fusion module.Then,we devise a contrastive learning loss to highlight the semantic differences between the correct answer and other answers.Finally,extensive experiments demonstrate the effectiveness of our method.展开更多
Knowledge-based Visual Question Answering(VQA)is a challenging task that requires models to access external knowledge for reasoning.Large Language Models(LLMs)have recently been employed for zero-shot knowledge-based ...Knowledge-based Visual Question Answering(VQA)is a challenging task that requires models to access external knowledge for reasoning.Large Language Models(LLMs)have recently been employed for zero-shot knowledge-based VQA due to their inherent knowledge storage and in-context learning capabilities.However,LLMs are commonly perceived as implicit knowledge bases,and their generative and in-context learning potential remains underutilized.Existing works demonstrate that the performance of in-context learning strongly depends on the quality and order of demonstrations in prompts.In light of this,we propose Knowledge Generation with Frozen Language Models(KGFLM),a novel method for generating explicit knowledge statements to improve zero-shot knowledge-based VQA.Our knowledge generation strategy aims to identify effective demonstrations and determine their optimal order,thereby activating the frozen LLM to produce more useful knowledge statements for better predictions.The generated knowledge statements can also serve as interpretable rationales.In our method,the selection and arrangement of demonstrations are based on semantic similarity and quality of demonstrations for each question,without requiring additional annotations.Furthermore,a series of experiments are conducted on A-OKVQA and OKVQA datasets.The results show that our method outperforms some superior zero-shot knowledge-based VQA methods.展开更多
Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based Visual Question Answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge injection.Although metho...Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based Visual Question Answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge injection.Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks,they may have some limitations,such as the possibility of not being able to retrieve the required knowledge.In this paper,we introduce a novel framework for knowledge-based VQA titled“Prompting Large Language Models with Knowledge-Injection”(PLLMKI).We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection.Unlike earlier approaches,we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs.Furthermore,we leverage open LLMs,incurring no additional costs.In comparison to existing baselines,our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets,namely OK-VQA and A-OKVQA,respectively.展开更多
With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of th...With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image.This limitation restricts the interpretative capacity of the VQA models and their abil-ity to explore specific image regions.To address this issue,this study proposes a grounded VQA model for robotic surgery,capable of localizing a specific region during answer prediction.Drawing inspiration from prompt learning in language models,a dual-modality prompt model was developed to enhance precise multimodal information interactions.Specifically,two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model.A visual complementary prompter merges visual prompt knowl-edge with visual information features to guide accurate localization.The textual complementary prompter aligns vis-ual information with textual prompt knowledge and textual information,guiding textual information towards a more accurate inference of the answer.Additionally,a multiple iterative fusion strategy was adopted for comprehensive answer reasoning,to ensure high-quality generation of textual and grounded answers.The experimental results vali-date the effectiveness of the model,demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets.展开更多
As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising way...As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network models.Unfortunately,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner.Typically,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks.To this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task.In particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic networks.The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG.We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets.Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.展开更多
Visual Question and Answering(VQA)has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses.While existing methods often rely on Convolu...Visual Question and Answering(VQA)has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses.While existing methods often rely on Convolutional Neural Networks(CNNs)for feature extraction and attention mechanisms for embedding learning,they frequently fail to capture the nuanced interactions between entities within images,leading to potential ambiguities in answer generation.In this paper,we introduce a novel network architecture,Dual-modality Integration Attention with Graph-based Feature Extraction(DIAGFE),which addresses these limitations by incorporating two key innovations:a Graph-based Feature Extraction(GFE)module that enhances the precision of visual semantics extraction,and a Dual-modality Integration Attention(DIA)mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation.Our model is trained with a composite loss function to refine its predictive accuracy.Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods,underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.展开更多
With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap...With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.展开更多
Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual r...Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions.The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively.However,it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details,which is the advantage of grid-based features.In this paper,we propose a Dual-Level Feature Embedding(DLFE)network,which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both features.Specifically,in DLFE,In DLFE,firstly,a novel Dual-Level Self-Attention(DLSA)modular is proposed to mine the intrinsic properties of the two features,where Positional Relation Attention(PRA)is designed to model the position information.Then,we propose a Feature Fusion Attention(FFA)to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection features.Finally,we use co-attention to learn the interactive features of the image and question and answer questions more accurately.Our method has significantly improved compared to the baseline,increasing accuracy from 66.01%to 70.63%on the test-std dataset of VQA 1.0 and from 66.24%to 70.91%for the test-std dataset of VQA 2.0.展开更多
Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are st...Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···} ∈“color” subspace yet cube ∈“shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces(i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.展开更多
基金Supported by the Program for Liaoning Excellent Talents in University(No.LR15045)the Liaoning Provincial Science and Technology Department Applied Basic Research Plan(No.101300243).
文摘Medical visual question answering(MedVQA)faces unique challenges due to the high precision required for images and the specialized nature of the questions.These challenges include insufficient feature extraction capabilities,a lack of textual priors,and incomplete information fusion and interaction.This paper proposes an enhanced bootstrapping language-image pre-training(BLIP)model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention(FCA-BLIP)to address these issues.First,FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer(BERT)models to enhance feature extraction capabilities.It enables a more precise analysis of the details in images and questions.Next,the pre-trained BLIP model is used to extract features from image-text sample pairs.The model can understand the semantic relationships and shared information between images and text.Finally,a novel attention structure is developed to fuse the multimodal feature vectors,thereby improving the alignment accuracy between modalities.Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging,the proposed method outperforms the existing major models in several performance metrics.
基金Project(61702063)supported by the National Natural Science Foundation of China。
文摘With the warming up and continuous development of machine learning,especially deep learning,the research on visual question answering field has made significant progress,with important theoretical research significance and practical application value.Therefore,it is necessary to summarize the current research and provide some reference for researchers in this field.This article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering field.First,relevant background knowledge about VQA(Visual Question Answering)was introduced.Secondly,the issues and challenges of visual question answering were discussed,and at the same time,some promising discussion on the particular methodologies was given.Thirdly,the key sub-problems affecting visual question answering were summarized and analyzed.Then,the current commonly used data sets and evaluation indicators were summarized.Next,in view of the popular algorithms and models in VQA research,comparison of the algorithms and models was summarized and listed.Finally,the future development trend and conclusion of visual question answering were prospected.
文摘The original intention of visual question answering(VQA)models is to infer the answer based on the relevant information of the question text in the visual image,but many VQA models often yield answers that are biased by some prior knowledge,especially the language priors.This paper proposes a mitigation model called language priors mitigation-VQA(LPM-VQA)for the language priors problem in VQA model,which divides language priors into positive and negative language priors.Different network branches are used to capture and process the different priors to achieve the purpose of mitigating language priors.A dynamically-changing language prior feedback objective function is designed with the intermediate results of some modules in the VQA model.The weight of the loss value for each answer is dynamically set according to the strength of its language priors to balance its proportion in the total VQA loss to further mitigate the language priors.This model does not depend on the baseline VQA architectures and can be configured like a plug-in to improve the performance of the model over most existing VQA models.The experimental results show that the proposed model is general and effective,achieving state-of-the-art accuracy in the VQA-CP v2 dataset.
基金This work was supported by the Sichuan Science and Technology Program(2021YFQ0003).
文摘Visual question answering(VQA)has attracted more and more attention in computer vision and natural language processing.Scholars are committed to studying how to better integrate image features and text features to achieve better results in VQA tasks.Analysis of all features may cause information redundancy and heavy computational burden.Attention mechanism is a wise way to solve this problem.However,using single attention mechanism may cause incomplete concern of features.This paper improves the attention mechanism method and proposes a hybrid attention mechanism that combines the spatial attention mechanism method and the channel attention mechanism method.In the case that the attention mechanism will cause the loss of the original features,a small portion of image features were added as compensation.For the attention mechanism of text features,a selfattention mechanism was introduced,and the internal structural features of sentences were strengthened to improve the overall model.The results show that attention mechanism and feature compensation add 6.1%accuracy to multimodal low-rank bilinear pooling network.
基金Supported by National Nature Science Foudation of China(61976160,61906137,61976158,62076184,62076182)Shanghai Science and Technology Plan Project(21DZ1204800)。
文摘Background External knowledge representations play an essential role in knowledge-based visual question and answering to better understand complex scenarios in the open world.Recent entity-relationship embedding approaches are deficient in representing some complex relations,resulting in a lack of topic-related knowledge and redundancy in topic-irrelevant information.Methods To this end,we propose MKEAH:Multimodal Knowledge Extraction and Accumulation on Hyperplanes.To ensure that the lengths of the feature vectors projected onto the hyperplane compare equally and to filter out sufficient topic-irrelevant information,two losses are proposed to learn the triplet representations from the complementary views:range loss and orthogonal loss.To interpret the capability of extracting topic-related knowledge,we present the Topic Similarity(TS)between topic and entity-relations.Results Experimental results demonstrate the effectiveness of hyperplane embedding for knowledge representation in knowledge-based visual question answering.Our model outperformed state-of-the-art methods by 2.12%and 3.24%on two challenging knowledge-request datasets:OK-VQA and KRVQA,respectively.Conclusions The obvious advantages of our model in TS show that using hyperplane embedding to represent multimodal knowledge can improve its ability to extract topic-related knowledge.
基金supported by the Start-up Fund for RAPs under the Strategic Hiring Scheme(P0048623)from HKSARthe Global STEM Professorship Scheme(P0046113)and Henry G.Leong Endowed Professorship in Elderly Vision Health.
文摘Purposes:To develop a bilingual multimodal visual question answering(VQA)benchmark for evaluating Vision-language models(VLMs)in ophthalmology.Methods:In this cross-sectional study,ophthalmic image posts and associated captions published between Jan 1,2016,and Dec 31,2024,were collected from WeChat Official Accounts.Based on these captions,bilingual question-answer(QA)pairs in Chinese and English were generated using GPT-4o-mini.QA pairs were categorized into six subsets by question type and language:binary(Binary_CN,Binary_EN),single-choice(Singlechoice_CN,Single-choice_EN),and open-ended(Open-ended_CN,Open-ended_EN).The benchmark was used to evaluate six VLMs:GPT-4o,Gemini 2.0 Flash,Qwen2.5-VL-72B-Instruct,Janus-Pro-7B,InternVL3-8B,and HealthGPT-L14.Primary outcome was overall accuracy;secondary outcomes included subset-,subspeciality-,and modality-specific accuracy.Performance on open-ended questions were also quantified using languagebased metrics,including AlignScore,BARTScore,BERTScore,BLEU,CIDEr,METEOR,and ROUGE_L.Error types in open-ended responses were manually analyzed through stratified sampling.Results:OphthalWeChat included 3469 images and 30120 QA pairs cover 9 ophthalmic subspecialties,548 conditions,29 imaging modalities,and 68 modality combinations.Gemini 2.0 Flash achieved the highest overall accuracy(0.555),significantly outperforming GPT-4o(0.527),Qwen2.5-VL-72B-Instruct(0.520),HealthGPTL14(0.502),InternVL3-L14(0.453),and Janus-Pro-7B(0.333)(all P<0.001).It also led in both Chinese(0.551)and English subsets(0.559).By subset,Gemini 2.0 Flash excelled in Binary_CN(0.687)and Singlechoice_CN(0.666);HealthGPT-L14 performed best in Single-choice_EN(0.739);while GPT-4o ranked highest in Binary_EN(0.717),Open-ended_CN(0.254),and Open-ended_EN(0.271).Language-based metrics showed inconsistent rankings relative to accuracy in open-ended subsets.Performance varied across subspecialties and modalities,with Gemini 2.0 Flash leading in 6 of 9 subspecialties and 11 of top-15 imaging modalities.Error types analysis revealed lesion/diagnosis errors as the most frequent(35.6%-50.6%),followed by anatomical location errors(28.3%-37.5%).Conclusions:This study presents the first bilingual VQA benchmark for ophthalmology,distinguished by its realworld context and inclusion of multiple examinations per patient.The dataset enables quantitative evaluation of VLMs,supporting the development of accurate and specialized AI systems for eye care.
文摘Knowledge-based visual question answering(KB-VQA),requiring external world knowledge beyond the image for reasoning,is more challenging than traditional visual question answering.Recent works have demonstrated the effectiveness of using a large(vision)language model as an implicit knowledge source to acquire the necessary information.However,the knowledge stored in large models(LMs)is often coarse-grained and inaccurate,causing questions requiring finer-grained information to be answered incorrectly.In this work,we propose a variational expectation-maximization(EM)framework that bootstraps the VQA performance of LMs with its own answer.In contrast to former VQA pipelines,we treat the outside knowledge as a latent variable.In the E-step,we approximate the posterior with two components:First,a rough answer,e.g.,a general description of the image,which is usually the strength of LMs,and second,a multi-modal neural retriever to retrieve question-specific knowledge from an external knowledge base.In the M-step,the training objective optimizes the ability of the original LMs to generate rough answers as well as refined answers based on the retrieved information.Extensive experiments show that our proposed framework,BootLM,has a strong retrieval ability and achieves state-of-the-art performance on knowledge-based VQA tasks.
文摘Visual Question Answering(VQA)is a complex task that requires a deep understanding of both visual content and natural language questions.The challenge lies in enabling models to recognize and interpret visual elements and to reason through questions in a multi-step,compositional manner.We propose a novel Transformer-based model that introduces specialized tokenization techniques to effectively capture intricate relationships between visual and textual features.The model employs an enhanced self-attention mechanism,enabling it to attend to multiple modalities simultaneously,while a co-attention unit dynamically guides focus to the most relevant image regions and question components.Additionally,a multi-step reasoning module supports iterative inference,allowing the model to excel at complex reasoning tasks.Extensive experiments on benchmark datasets demonstrate the model’s superior performance,with accuracies of 98.6%on CLEVR,63.78%on GQA,and 68.67%on VQA v2.0.Ablation studies confirm the critical contribution of key components,such as the reasoning module and co-attention mechanism,to the model’s effectiveness.Qualitative analysis of the learned attention distributions further illustrates the model’s dynamic reasoning process,adapting to task complexity.Overall,our study advances the adaptation of Transformer architectures for VQA,enhancing both reasoning capabilities and model interpretability in visual reasoning tasks.
基金supported by National Natural Science Foundation of China(Nos.62032013 and 62102074)the Science and Technology Projects in Liaoning Province,China(No.2023JH3/10200005).
文摘Medical visual question answering(Med-VQA)is a task that aims to answer clinical questions given a medical image.Existing literature generally treats it as a classic classification task based on interaction features of the image and question.However,such a paradigm ignores the valuable semantics of candidate answers as well as their relations.From the real-world dataset,we observe that:1)The text of candidate answers has a strong intrinsic correlation with medical images;2)Subtle differences among multiple candidate answers are crucial for identifying the correct one.Therefore,we propose an answer semantics enhanced(ASE)method to integrate the semantics of answers and capture their subtle differences.Specifically,we enhance the semantic correlation of image-question-answer triplets by aligning images and question-answer tuples within the feature fusion module.Then,we devise a contrastive learning loss to highlight the semantic differences between the correct answer and other answers.Finally,extensive experiments demonstrate the effectiveness of our method.
基金supported by the National Natural Science Foundation of China(No.62271125).
文摘Knowledge-based Visual Question Answering(VQA)is a challenging task that requires models to access external knowledge for reasoning.Large Language Models(LLMs)have recently been employed for zero-shot knowledge-based VQA due to their inherent knowledge storage and in-context learning capabilities.However,LLMs are commonly perceived as implicit knowledge bases,and their generative and in-context learning potential remains underutilized.Existing works demonstrate that the performance of in-context learning strongly depends on the quality and order of demonstrations in prompts.In light of this,we propose Knowledge Generation with Frozen Language Models(KGFLM),a novel method for generating explicit knowledge statements to improve zero-shot knowledge-based VQA.Our knowledge generation strategy aims to identify effective demonstrations and determine their optimal order,thereby activating the frozen LLM to produce more useful knowledge statements for better predictions.The generated knowledge statements can also serve as interpretable rationales.In our method,the selection and arrangement of demonstrations are based on semantic similarity and quality of demonstrations for each question,without requiring additional annotations.Furthermore,a series of experiments are conducted on A-OKVQA and OKVQA datasets.The results show that our method outperforms some superior zero-shot knowledge-based VQA methods.
基金supported by the National Natural Science Foundation of China(No.62272100)Consulting Project of Chinese Academy of Engineering(No.2023-XY-09)Fundamental Research Funds for the Central Universities.
文摘Previous works employ the Large Language Model(LLM)like GPT-3 for knowledge-based Visual Question Answering(VQA).We argue that the inferential capacity of LLM can be enhanced through knowledge injection.Although methods that utilize knowledge graphs to enhance LLM have been explored in various tasks,they may have some limitations,such as the possibility of not being able to retrieve the required knowledge.In this paper,we introduce a novel framework for knowledge-based VQA titled“Prompting Large Language Models with Knowledge-Injection”(PLLMKI).We use vanilla VQA model to inspire the LLM and further enhance the LLM with knowledge injection.Unlike earlier approaches,we adopt the LLM for knowledge enhancement instead of relying on knowledge graphs.Furthermore,we leverage open LLMs,incurring no additional costs.In comparison to existing baselines,our approach exhibits the accuracy improvement of over 1.3 and 1.7 on two knowledge-based VQA datasets,namely OK-VQA and A-OKVQA,respectively.
基金supported in part by the National Key Research and Development Program of China,No.2021ZD0112400National Natural Science Foundation of China,No.U1908214+5 种基金Program for Innovative Research Team at the University of Liaoning Province,No.LT2020015the Support Plan for Key Field Innovation Team of Dalian,No.2021RT06the Support Plan for Leading Innovation Team of Dalian University,No.XLJ202010Program for the Liaoning Province Doctoral Research Starting Fund,No.2022-BS-336Key Laboratory of Advanced Design and Intelligent Computing(Dalian University),and Ministry of Education,No.ADIC2022003Interdisciplinary Project of Dalian University,No.DLUXK-2023-QN-015.
文摘With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image.This limitation restricts the interpretative capacity of the VQA models and their abil-ity to explore specific image regions.To address this issue,this study proposes a grounded VQA model for robotic surgery,capable of localizing a specific region during answer prediction.Drawing inspiration from prompt learning in language models,a dual-modality prompt model was developed to enhance precise multimodal information interactions.Specifically,two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model.A visual complementary prompter merges visual prompt knowl-edge with visual information features to guide accurate localization.The textual complementary prompter aligns vis-ual information with textual prompt knowledge and textual information,guiding textual information towards a more accurate inference of the answer.Additionally,a multiple iterative fusion strategy was adopted for comprehensive answer reasoning,to ensure high-quality generation of textual and grounded answers.The experimental results vali-date the effectiveness of the model,demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets.
基金supported in part by the National Natural Science Foundation of China under Grant No.62176061the Science and Technology Commission of Shanghai Municipality under Grant No.22511105000.
文摘As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network models.Unfortunately,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner.Typically,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks.To this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task.In particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic networks.The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG.We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets.Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
基金supported by the Major Scientific and Technological Projects of China National Petroleum Corporation(CNPC)(No.ZD2019-183-001)the Fundamental Research Funds for the Central Universities(Nos.22CX01001A-1 and 22CX01001A-3).
文摘Visual Question and Answering(VQA)has garnered significant attention as a domain that requires the synthesis of visual and textual information to produce accurate responses.While existing methods often rely on Convolutional Neural Networks(CNNs)for feature extraction and attention mechanisms for embedding learning,they frequently fail to capture the nuanced interactions between entities within images,leading to potential ambiguities in answer generation.In this paper,we introduce a novel network architecture,Dual-modality Integration Attention with Graph-based Feature Extraction(DIAGFE),which addresses these limitations by incorporating two key innovations:a Graph-based Feature Extraction(GFE)module that enhances the precision of visual semantics extraction,and a Dual-modality Integration Attention(DIA)mechanism that efficiently fuses visual and question features to guide the model towards more accurate answer generation.Our model is trained with a composite loss function to refine its predictive accuracy.Rigorous experiments on the VQA2.0 dataset demonstrate that DIAGFE outperforms existing methods,underscoring the effectiveness of our approach in advancing VQA research and its potential for cross-modal understanding.
文摘With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.
文摘Visual Question Answering(VQA)has sparked widespread interest as a crucial task in integrating vision and language.VQA primarily uses attention mechanisms to effectively answer questions to associate relevant visual regions with input questions.The detection-based features extracted by the object detection network aim to acquire the visual attention distribution on a predetermined detection frame and provide object-level insights to answer questions about foreground objects more effectively.However,it cannot answer the question about the background forms without detection boxes due to the lack of fine-grained details,which is the advantage of grid-based features.In this paper,we propose a Dual-Level Feature Embedding(DLFE)network,which effectively integrates grid-based and detection-based image features in a unified architecture to realize the complementary advantages of both features.Specifically,in DLFE,In DLFE,firstly,a novel Dual-Level Self-Attention(DLSA)modular is proposed to mine the intrinsic properties of the two features,where Positional Relation Attention(PRA)is designed to model the position information.Then,we propose a Feature Fusion Attention(FFA)to address the semantic noise caused by the fusion of two features and construct an alignment graph to enhance and align the grid and detection features.Finally,we use co-attention to learn the interactive features of the image and question and answer questions more accurately.Our method has significantly improved compared to the baseline,increasing accuracy from 66.01%to 70.63%on the test-std dataset of VQA 1.0 and from 66.24%to 70.91%for the test-std dataset of VQA 2.0.
基金supported in part by the Australian Research Council(ARC)(Nos.FL-170100117,DP-180103424,IC-190100031 and LE-200100049).
文摘Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure to explore the intrinsic semantic hierarchy of visual concepts, e.g., {red, blue,···} ∈“color” subspace yet cube ∈“shape”. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces(i.e., visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5% for reasoning with perturbations and 15.6% for compositional generalization tests.