摘要
近年来,大语言模型(LLMs)和多模态大模型(MLMs)在自然语言处理和多模态内容理解方面取得了显著成就,然而,这些通用模型在处理文化遗产相关任务时存在明显缺陷,如对领域专业术语的理解存在偏差、缺乏文化历史背景导致回答不够深入以及知识幻觉等问题,使得输出结果难以满足实际需求。针对这些挑战,首次提出了面向文化遗产领域的多模态大模型——“博古问津”。首先,设计半自动化策略构建大规模的多模态文化遗产数据集并形成多模态知识图谱;然后,利用构建的数据集对通用大模型进行图文对齐和指令微调两阶段训练,以适应文化遗产领域的特定需求;此外,还引入了知识图谱作为辅助知识库,通过图文检索和关系检索策略,有效提升了模型在文化遗产领域问答任务上的可信度和可解释性。实验结果表明,“博古问津”在文物图像描述、属性问题解答及关系问题理解等多个方面表现优异,相较于通用多模态大模型,对复杂文化内容的理解和回答能力提升效果显著,分别在文物图像描述、文物属性问题和文物关系问题3个不同任务的综合分值上高出次优模型21.4%、53%和20.6%。
In recent years,large language models(LLMs)and multimodal large models(MLMs)have made significant achievements in natural language processing and multimodal content understanding.However,these general-purpose models have obvious shortcomings when dealing with tasks related to cultural heritage,such as biased understanding of domain-specific terminology,lack of cultural and historical background leading to superficial answers,and knowledge hallucination issues,making it difficult for the results to meet actual needs.In response to these challenges,this paper first proposes a multimodal large model oriented towards the field of cultural heritage:Bogu-Wenjin.This study first designs a semi-automated strategy to construct a large-scale multimodal cultural heritage dataset and forms a multimodal knowledge graph.Using the constructed dataset,the general large model is trained in two stages:image-text alignment and instruction fine-tuning,to adapt to the specific needs of the cultural heritage field.In addition,a knowledge graph is introduced as an auxiliary knowledge base,and the credibility and interpretability of the model in the field of cultural heritage Q&A tasks are effectively improved through graph-text retrieval and relationship retrieval strategies.Experimental results show that Bogu-Wenjin performs excellently in various aspects such as artifact image description,attribute question answering,and relationship question understanding.Compared with general multimodal large models,it significantly improves the ability to understand and answer complex cultural content,with a comprehensive score increase of 21.4%,53%and 20.6%in artifact image description,artifact attribute questions,and artifact relationship questions respectively over the second-best model.
作者
赵万青
徐朝阳
谢智伟
张少博
张晓丹
彭进业
ZHAO Wanqing;XU Chaoyang;XIE Zhiwei;ZHANG Shaobo;ZHANG Xiaodan;PENG Jinye(School of Electronic Information,Northwest University,Xi'an 710127,China;Shaanxi Key Laboratory of Higher Education Institution of Generative Artificial Intelligence and Mixed Reality,Xi'an 710127,China)
出处
《西北大学学报(自然科学版)》
北大核心
2025年第6期1267-1284,共18页
Journal of Northwest University(Natural Science Edition)
基金
国家重点研发计划(2024YFF0907600)
国家自然科学基金(62273275)
陕西省自然科学基础研究计划青年项目(2025JC-YBQN-847)。
关键词
多模态大模型
文化遗产
视觉问答
知识图谱
模型微调
知识增强
multimodal large models
cultural heritage
visual question answering
knowledge graphs
model fine-tuning
knowledge enhancement