Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavi...Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLIVE dataset show that the proposed CAVE model exhibits outstanding performance.展开更多
为了解决目标检测器预训练数据集与图像描述生成任务数据集存在类别差异导致物体识别错误,以及不同场景样本规模存在差异导致模型对少见场景中对象间关系理解不足的问题,提出融合多尺度视觉和文本语义特征的图像描述生成算法(multi-scal...为了解决目标检测器预训练数据集与图像描述生成任务数据集存在类别差异导致物体识别错误,以及不同场景样本规模存在差异导致模型对少见场景中对象间关系理解不足的问题,提出融合多尺度视觉和文本语义特征的图像描述生成算法(multi-scale visual and textual semantic feature fusion for image captioning,MVTFF-IC)。多尺度视觉特征融合(multi-scale visual feature fusion,MVFF)模块通过图注意力网络对全局、网格和区域特征进行建模,以获取更具代表性的视觉表征;深度语义融合模块(deep semantic fusion module,DSFM)通过交叉注意力机制整合包含对象关系的文本语义特征,以生成更准确的描述。在微软常见物体场景(Microsoft common objects in context,MSCOCO)数据集上的试验结果表明,MVTFF-IC基于共识的图像描述评价指标CDIEr达到136.7,优于许多现有的流行算法,能够更准确地捕捉图像中的关键信息,生成高质量的描述。展开更多
群体研讨支持系统(Group Argument Support Systems,GASS)的匿名、并行输入及自动化记录群体发言的特征,在辅助群体产生大量有价值观点的同时,也常常导致"信息过载"和"知识断层"。介绍了一个自动化聚类工具来增强...群体研讨支持系统(Group Argument Support Systems,GASS)的匿名、并行输入及自动化记录群体发言的特征,在辅助群体产生大量有价值观点的同时,也常常导致"信息过载"和"知识断层"。介绍了一个自动化聚类工具来增强群体的认知能力并提高电子会议的效率。首先识别了GASS环境下自动化主题聚类的一些挑战并回顾了相关研究,结合GASS的研讨模式、研讨文本特征及中文文本分析的要求,给出了中文分词、停词表处理以及有效词语识别的文本分析技术。提出基于主题分析的特征向量选择方法,并基于自组织映射的神经网络思想,用Java语言设计并开发了一个自动聚类工具。实验表明,该工具可以达到0.28的聚类准确率,0.35的聚类全面率,产生0.83的聚类错误率。展开更多
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)Shanghai Pujiang Program,China(No.22PJ1423400)。
文摘Visual entailment(VE)is a prototypical task in multimodal visual reasoning,where current methods frequently utilize large language models(LLMs)as the knowledge base to assist in answering questions.These methods heavily rely on the textual modality,which inherently cannot capture the full extent of information contained within images.We propose a context-aware visual entailment(CAVE)model,which introduces a novel aggregation module designed to extract high-level semantic features from images.This module integrates lower-level semantic image features into high-level visual tokens,formatting them similarly to text tokens so that they can serve as inputs for LLMs.The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension.Additionally,the CAVE model incorporates a new input format and training methodology,which is rooted in instruction tuning and in-context learning techniques.The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs.Experimental results on the E-SNLIVE dataset show that the proposed CAVE model exhibits outstanding performance.
文摘为了解决目标检测器预训练数据集与图像描述生成任务数据集存在类别差异导致物体识别错误,以及不同场景样本规模存在差异导致模型对少见场景中对象间关系理解不足的问题,提出融合多尺度视觉和文本语义特征的图像描述生成算法(multi-scale visual and textual semantic feature fusion for image captioning,MVTFF-IC)。多尺度视觉特征融合(multi-scale visual feature fusion,MVFF)模块通过图注意力网络对全局、网格和区域特征进行建模,以获取更具代表性的视觉表征;深度语义融合模块(deep semantic fusion module,DSFM)通过交叉注意力机制整合包含对象关系的文本语义特征,以生成更准确的描述。在微软常见物体场景(Microsoft common objects in context,MSCOCO)数据集上的试验结果表明,MVTFF-IC基于共识的图像描述评价指标CDIEr达到136.7,优于许多现有的流行算法,能够更准确地捕捉图像中的关键信息,生成高质量的描述。
文摘群体研讨支持系统(Group Argument Support Systems,GASS)的匿名、并行输入及自动化记录群体发言的特征,在辅助群体产生大量有价值观点的同时,也常常导致"信息过载"和"知识断层"。介绍了一个自动化聚类工具来增强群体的认知能力并提高电子会议的效率。首先识别了GASS环境下自动化主题聚类的一些挑战并回顾了相关研究,结合GASS的研讨模式、研讨文本特征及中文文本分析的要求,给出了中文分词、停词表处理以及有效词语识别的文本分析技术。提出基于主题分析的特征向量选择方法,并基于自组织映射的神经网络思想,用Java语言设计并开发了一个自动聚类工具。实验表明,该工具可以达到0.28的聚类准确率,0.35的聚类全面率,产生0.83的聚类错误率。