摘要
现有空间非合作目标数据集仅针对单模态,存在细粒度位置关系无法有效反映、多模态模型无法被训练的问题。本研究构建了首个针对空间非合作目标的多模态意图识别数据集,探索了利用大模型进行空间目标多模态意图数据生成,反哺模型再训练的范式。基于DeepseekV3和ChatGPT4o,通过prompt设计、强化反馈和合规性控制,对细粒度空间目标位置关系做了详细的图像-文本对应和复合意图模拟,构建了72种细分场景下的1.1万对带有噪声的图像-文本数据。基于本数据集,对多模态模型CLIP及单模态模型Resnet分别进行微调和训练。实验结果表明,多模态数据可提升4.47%的意图识别准确率。本数据集为多模态模型在空间目标意图识别中的训练与评估提供基础,推动模型转向图像-文本和“人类-模型”交互的训练模式,促进大模型在该领域的应用发展。
The existing spatial non cooperative target datasets only focus on single modality,and there are problems with the inability to effectively reflect fine-grained positional relationships and the inability to train multimodal models.This article constructs the first multimodal intent recognition dataset for non cooperative spatial targets,exploring the paradigm of using large models to generate multimodal intent data for spatial targets and feedback model retraining.Based on DeepseekV3 and ChatGPT 4o,detailed image text correspondence and composite intent simulation of fine-grained spatial target position relationships were conducted through prompt design,enhanced feedback,and compliance control.11000 pairs of noisy image text data were constructed for 72 segmented scenarios.Based on this dataset,fine tune and train the multimodal model CLIP and the unimodal model Resnet separately.The experimental results show that multimodal data can improve intent recognition accuracy by 4.47%.This dataset provides a foundation for the training and evaluation of multimodal models in spatial target intent recognition,driving the model towards training modes of image text and"human model"interaction,and promoting the development of large-scale models in this field.
作者
马天
尚楚洋
任万珠
马少锋
张瀚文
MA Tian;SHANG Chuyang;REN Wanzhu;MA Shaofeng;ZHANG Hanwen(Xi’an University of Science and Technology,Xi’an 710054,P.R.China;Huaneng Shaanxi Power Generation Co.,Ltd.New Energy Branch,Xi’an 710116,P.R.China;Huazhong University of Science and Technology,Wuhan 430074,P.R.China)
基金
国家重点研发计划(2022ZD0119005)
空间智能控制技术全国重点实验室2024年度科工局稳定支持基金项目(HTKJ2024KL502027)。
关键词
空间目标
意图识别
大模型
多模态
数据集
space target
intention recognition
Large Language Model(LLM)
multimodal
dataset