期刊文献+
共找到1,673篇文章
< 1 2 84 >
每页显示 20 50 100
Video action recognition meets vision-language models exploring human factors in scene interaction: a review
1
作者 GUO Yuping GAO Hongwei +3 位作者 YU Jiahui GE Jinchao HAN Meng JU Zhaojie 《Optoelectronics Letters》 2025年第10期626-640,共15页
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions... Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions. 展开更多
关键词 human factors video action recognition vision language models analyze dynamic behaviors spatiotemporal granularity video action recognition var aims multimodal alignment scene interaction
原文传递
CIT-Rec:Enhancing Sequential Recommendation System with Large Language Models
2
作者 Ziyu Li Zhen Chen +2 位作者 Xuejing Fu Tong Mo Weiping Li 《Computers, Materials & Continua》 2026年第3期2328-2343,共16页
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact... Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations. 展开更多
关键词 Large language models vision language models sequential recommendation instruction tuning
在线阅读 下载PDF
A versatile framework for analyzing galaxy image data by incorporating Human-in-the-loop in a large vision model
3
作者 Ming-Xiang Fu Yu Song +14 位作者 Jia-Meng Lv Liang Cao Peng Jia Nan Li Xiang-Ru Li Ji-Feng Liu A-Li Luo Bo Qiu Shi-Yin Shen Liang-Ping Tu Li-Li Wang Shou-Lin Wei Hai-Feng Yang Zhen-Ping Yi Zhi-Qiang Zou 《Chinese Physics C》 SCIE CAS CSCD 2024年第9期176-187,共12页
The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.I... The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy. 展开更多
关键词 artificial intelligence large vision model human-in-the-loop ASTRONOMY galaxies
原文传递
基于残差注意力TCN与vision transformer的齿轮剩余寿命预测
4
作者 胡爱军 李晨阳 +2 位作者 邢磊 周卓浩 向玲 《航空动力学报》 北大核心 2025年第12期14-24,共11页
齿轮系统的运行状况受到多个因素的影响,这些因素在时间上存在长期依赖关系,并在局部和全局特征之间存在差异。为了有效地捕捉数据中的时间依赖性并自适应调整对特征的关注度,提出具有残差卷积块注意力机制的时间卷积网络(RCMTCN)。通... 齿轮系统的运行状况受到多个因素的影响,这些因素在时间上存在长期依赖关系,并在局部和全局特征之间存在差异。为了有效地捕捉数据中的时间依赖性并自适应调整对特征的关注度,提出具有残差卷积块注意力机制的时间卷积网络(RCMTCN)。通过在卷积块注意力机制中引入残差连接,模型能够同时关注原始输入和注意力加权的信息,提高了模型对局部信息的感知能力。在此基础上,将vision transformer(ViT)模型与RCMTCN相结合对齿轮的剩余使用寿命(RUL)预测,ViT模型能有效地捕获数据中的全局信息。两者融合后能充分展现在处理时间序列数据局部特征提取能力和全局信息关注方面的优势,提高对多维度特征的感知能力。最后,通过在两种工况齿轮性能退化数据集上对模型进行验证,选用点蚀故障数据进行训练,分别对点蚀和断齿故障进行测试。实验结果表明:与其他方法相比,所提出的方法能更充分地提取关键特征信息,在点蚀故障上评分函数得分为0.8898,且在断齿故障上得分为0.8587,表现出良好的工况、故障适应能力。 展开更多
关键词 齿轮 剩余使用寿命 时序网络 注意力机制 vision transformer模型
原文传递
Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
5
作者 Jieyu An Wan Mohd Nazmee Wan Zainon Binfen Ding 《Intelligent Automation & Soft Computing》 SCIE 2023年第8期1673-1689,共17页
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on... Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment. 展开更多
关键词 Multimodal sentiment analysis vision–language pre-trained model contrastive learning sentiment classification
在线阅读 下载PDF
Vision-LSTM模型在甲状腺影像报告与数据系统4b类甲状腺结节超声影像诊断中的应用与评估
6
作者 张鑫茹 李扬 +2 位作者 孙萌 聂玮 马喆 《山东大学学报(医学版)》 北大核心 2025年第11期68-74,共7页
目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性... 目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性。方法 收集我院401例TI-RADS 4b类甲状腺结节的超声影像数据,并利用这些数据对Vision-LSTM模型进行训练和验证。将AI模型的诊断结果与初级医生及高级医生的诊断结果进行对比,评估其在诊断准确性、稳定性等方面的表现;采用曲线下面积(area under the curve,AUC)、精确率-召回率(precision-recall,PR)曲线等指标对模型性能进行量化分析。结果 在独立验证中,Vision-LSTM模型的AUC(0.88)与准确率(89.4%)均显著高于初级医生(AUC:0.624),并达到与高级医生(AUC:0.787)相当的水平,证明了其辅助诊断的应用潜力。AI模型能够准确识别超声影像中的复杂特征,稳定输出一致的诊断结果,展现出较高的准确性和可靠性。结论 基于Vision-LSTM模型的AI技术可显著提升TI-RADS 4b类甲状腺结节的诊断效率与准确性,为医生提供有效辅助,减轻工作负担。 展开更多
关键词 甲状腺影像报告与数据系统 甲状腺结节 vision-LSTM模型 诊断准确性 人工智能
原文传递
Modeling of a Linear Scanning 3D Vision Coordinate Measurement System
7
作者 孙玉芹 黄庆成 车仁生 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 1998年第3期32-35,共4页
This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-or... This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system. 展开更多
关键词 STRUCTURED light laser STRIPE sensor 3D vision CMM mathematic model EXTRINSIC CALIBRATION
在线阅读 下载PDF
Structured scene modeling using micro stereo vision system with large field of view
8
作者 颜世莹 朱玉文 +1 位作者 刘佳音 贾云得 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2001年第3期296-299,共4页
This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axi... This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation. 展开更多
关键词 Index terms structured scene modeling stereo vision wide field of view mobile robot
在线阅读 下载PDF
Special Topic on Security of Large Models
9
作者 SU Zhou DU Linkang 《ZTE Communications》 2025年第3期1-2,共2页
Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved percepti... Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry. 展开更多
关键词 large modelssuch SECURITY multimodal agentshave multimodal rea soningand large language models llms vision language data provenance copyright auditing backdoor vulnerabilities vision language models
在线阅读 下载PDF
Foundation models:Insights and implications for gastrointestinal cancer
10
作者 Lei Shi Rui Huang +1 位作者 Li-Ling Zhao An-Jie Guo 《World Journal of Gastroenterology》 2025年第47期7-34,共28页
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli... Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies. 展开更多
关键词 Foundation models Gastrointestinal cancers Large language models vision foundation models Multimodal large language models
在线阅读 下载PDF
High-precision copper-grade identification via a vision transformer with PGNAA
11
作者 Jie Cao Chong-Gui Zhong +6 位作者 Han-Ting You Yan Zhang Ren-Bo Wang Shu-Min Zhou Jin-Hui Qu Rui Chen Shi-Liang Liu 《Nuclear Science and Techniques》 2025年第7期89-99,共11页
The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between ne... The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between neutrons and samples to achieve the qualitative and quantitative detection of sample components.In this study,we present a novel method for identifying copper grade by combining the vision transformer(ViT)model with the PGNAA technique.First,a Monte Carlo simulation is employed to determine the optimal sizes of the neutron moderator,thermal neutron absorption material,and dimensions of the device.Subsequently,based on the parameters obtained through optimization,a PGNAA copper ore measurement model is established.The gamma spectrum of the copper ore is analyzed using the ViT model.The ViT model is optimized for hyperparameters using a grid search.To ensure the reliability of the identification results,the test results are obtained through five repeated tenfold cross-validations.Long short-term memory and convolutional neural network models are compared with the ViT method.These results indicate that the ViT method is efficient in identifying copper ore grades with average accuracy,precision,recall,F_(1)score,and F_(1)(-)score values of 0.9795,0.9637,0.9614,0.9625,and 0.9942,respectively.When identifying associated minerals,the ViT model can identify Pb,Zn,Fe,and Co minerals with identification accuracies of 0.9215,0.9396,0.9966,and 0.8311,respectively. 展开更多
关键词 Copper-grade identification vision transformer model Prompt gamma neutron activation analysis Monte Carlo N-particle
在线阅读 下载PDF
Local Geomagnetic Component Modeling of Auroral Images Based on Local‑Global Feature
12
作者 WANG Bo ZHANG Yuanshu +5 位作者 CHENG Wei TIAN Xinqin SHENG Qinghong LI Jun LING Xiao LIU Xiang 《Transactions of Nanjing University of Aeronautics and Astronautics》 2025年第6期710-727,共18页
Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora... Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora images to predict local geomagnetic station component,breaking the spatial limitations of geomagnetic stations.Our method utilizes the ViT backbone model in combination with convolutional networks to capture both the large-scale spatial correlation and distinct local feature correlation between aurora images and geomagnetic station data.Essentially,the model comprises a visual geometry group(VGG)image feature extraction network,a ViT-based encoder network,and a regression prediction network.Our experimental findings indicate that global features of aurora images play a more substantial role in predicting geomagnetic data than local features.Specifically,the hybrid model achieves a 39.1%reduction in root mean square error compared to the VGG model,a 29.5%reduction compared to the ViT model and a 35.3%reduction relative to the residual network(ResNet)model.Moreover,the fitting accuracy of the model surpasses that of the VGG,ViT,and ResNet models by 2.14%1.58%,and 4.1%,respectively. 展开更多
关键词 ultraviolet aurora image geomagnetic field prediction vision Transformer(ViT)hybrid model
在线阅读 下载PDF
基于毫米波感知的皮革瑕疵分类方法
13
作者 张健 关灏文 《小型微型计算机系统》 北大核心 2026年第2期257-264,共8页
皮革瑕疵分类是确保皮革产品质量的关键环节.传统的人工检测和图像处理方法受限于光照等环境因素,难以满足高效检测需求.近年来,深度学习特别是卷积神经网络(CNN)的应用提高了瑕疵检测的准确性和效率,但仍受到环境影响.毫米波雷达技术... 皮革瑕疵分类是确保皮革产品质量的关键环节.传统的人工检测和图像处理方法受限于光照等环境因素,难以满足高效检测需求.近年来,深度学习特别是卷积神经网络(CNN)的应用提高了瑕疵检测的准确性和效率,但仍受到环境影响.毫米波雷达技术作为一种新兴的无损检测方法,因其强穿透性和不受光照等因素影响的特性而逐渐受到关注.文中提出了一种结合毫米波雷达与改进Vision Transformer模型的皮革瑕疵分类方法,利用毫米波雷达信号提取皮革瑕疵的时频特征,并通过深度学习模型进行分类,在自建数据集上达到了95.62%的准确率,相比经典的分类模型优势显著. 展开更多
关键词 毫米波雷达 皮革瑕疵分类 vision Transformer模型 迁移学习
在线阅读 下载PDF
面向视觉算法的知识蒸馏研究综述
14
作者 潘海为 于丰铭 +3 位作者 张可佳 兰海燕 孟庆宇 李哲 《计算机研究与发展》 北大核心 2026年第1期90-122,共33页
知识蒸馏作为深度学习中的关键技术,通过将大型教师模型的知识传递给较小的学生模型,实现了模型的压缩与加速。在保证性能的前提下,显著减少了计算资源和存储需求,促进了高性能模型在资源受限的边缘设备上的部署。围绕知识蒸馏的最新研... 知识蒸馏作为深度学习中的关键技术,通过将大型教师模型的知识传递给较小的学生模型,实现了模型的压缩与加速。在保证性能的前提下,显著减少了计算资源和存储需求,促进了高性能模型在资源受限的边缘设备上的部署。围绕知识蒸馏的最新研究进展进行了系统性的综述,从知识类型和师生模型架构2个角度对知识蒸馏进行分类,详细汇总了输出特征知识、中间特征知识、关系特征知识3种典型知识类型的蒸馏方法,以及卷积架构到卷积架构、卷积架构到ViT(vision Transformer)架构、ViT架构到卷积架构和ViT架构到ViT架构的蒸馏方法;探讨了离线蒸馏、在线蒸馏、自蒸馏、无数据蒸馏、多教师蒸馏和助理蒸馏的学习方式;归纳了基于蒸馏过程、知识结构、温度系数及损失函数的蒸馏优化方法,分析了对抗性技术、自动机器学习、强化学习和扩散模型对蒸馏的改进,并总结了蒸馏技术在常见应用中的实现。尽管知识蒸馏取得了显著进展,但在实际应用和理论研究中仍面临诸多挑战。最后,对这些问题进行了深入分析,并对未来发展方向提出了见解。 展开更多
关键词 知识蒸馏 模型压缩 深度学习 卷积神经网络 视觉Transformer
在线阅读 下载PDF
Binocular Vision Object Positioning Method for Robots Based on Coarse-fine Stereo Matching 被引量:6
15
作者 Wei-Ping Ma Wen-Xin Li Peng-Xia Cao 《International Journal of Automation and computing》 EI CSCD 2020年第4期562-571,共10页
In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object position... In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object positioning.The random fern is used in the coarse matching to identify objects in the left and right images,and the pixel coordinates of the object center points in the two images are calculated to complete the center matching.In the fine matching,the right center point is viewed as an estimated value to set the search range of the right image,in which the region matching is implemented to find the best matched point of the left center point.Then,the similar triangle principle of the binocular vision model is used to calculate the 3D coordinates of the center point,achieving fast and accurate object positioning.Finally,the proposed method is applied to the object scene images and the robotic arm grasping platform.The experimental results show that the average absolute positioning error and average relative positioning error of the proposed method are 8.22 mm and 1.96%respectively when the object's depth distance is within 600 mm,the time consumption is less than 1.029s.The method can meet the needs of the robot grasping system,and has better accuracy and robustness. 展开更多
关键词 Object positioning stereo matching random fern normalized cross correlation binocular vision model
原文传递
基于自然标识物的大坝表面沉降机器视觉测量技术研究
16
作者 王堃 李登华 丁勇 《水电能源科学》 北大核心 2026年第1期193-197,共5页
针对传统机器视觉技术在复杂地形下需要专用靶标辅助测量大坝表面沉降的难题,利用坝体表面已有物体的天然纹路或自然地物(如基岩、山体等)作为标识物,提出了一种大坝表面沉降机器视觉测量技术,研发了简化投影模型的不动点区域图像配准方... 针对传统机器视觉技术在复杂地形下需要专用靶标辅助测量大坝表面沉降的难题,利用坝体表面已有物体的天然纹路或自然地物(如基岩、山体等)作为标识物,提出了一种大坝表面沉降机器视觉测量技术,研发了简化投影模型的不动点区域图像配准方法,通过对投影模型推导并简化估计拓宽了模型的应用场景,并可有效修正图像,融合了加速稳健特征转换算法,提取自然标识物的特征,精准匹配图像特征点阵,进而计算标识物的平均像素位移,通过获取像素距离与物理距离的比例关系,计算得到标识物的真实位移。试验结果表明,在70 m监测距离内,测量精度≤±1.00 mm,有效解决了大坝表面沉降低成本广域监测的难题,可应用于实际工程中。 展开更多
关键词 机器视觉 投影模型 大坝表面沉降 位移测量
原文传递
语义驱动的4D雷达与相机融合目标检测
17
作者 郑联庆 艾文瑾 +6 位作者 马志雄 任洪泽 卢守义 刘瑞 闫晟煜 朱西产 白傑 《汽车工程》 北大核心 2026年第2期342-351,共10页
融合相机与4D雷达实现鲁棒的三维目标检测对于自动驾驶的安全性至关重要。然而,现有的融合方法主要集中在低维度的雷达几何特征与图像像素特征对齐,缺乏对于整个场景级语义信息的利用,导致次优的检测性能。为此,本文首次提出视觉语言模... 融合相机与4D雷达实现鲁棒的三维目标检测对于自动驾驶的安全性至关重要。然而,现有的融合方法主要集中在低维度的雷达几何特征与图像像素特征对齐,缺乏对于整个场景级语义信息的利用,导致次优的检测性能。为此,本文首次提出视觉语言模型(vision-language model,VLM)辅助的4D雷达与相机融合框架RCT-Net,用于三维目标检测。首先,通过精心设计的用户提示来引导VLM生成包含感兴趣目标的场景文本描述,其通过文本编码器生成场景级语义特征。然后,设计了一个TBFusion(Text-BEV Fusion)模块,其通过新颖的跨模态注意力机制,将场景语义特征深度整合到鸟瞰图(Bird's-Eye-View)空间。该模块一方面提供先验知识来引导图像特征视角转换,另一方面在最终的特征融合阶段进一步对多模态BEV特征进行语义增强。最后,由三维检测头对增强后的特征进行解码,实现目标属性预测。在公开的4D雷达数据集TJ4DRadSet和View-of-Delft的大量实验表明,RCT-Net实现了优异的性能,其3D mAP分别达到了41.34%和57.02%,验证了本框架的有效性与先进性。 展开更多
关键词 自动驾驶 多模态融合 三维目标检测 4D毫米波雷达 视觉语言模型
在线阅读 下载PDF
面向具身操作的视觉-语言-动作模型综述
18
作者 李浩然 陈宇辉 +5 位作者 崔文博 刘卫恒 刘锴 周明才 张正涛 赵冬斌 《自动化学报》 北大核心 2026年第1期18-51,共34页
具身智能系统通过智能体与环境不断交互,从而提升智能体能力,受到学术界和产业界的广泛关注.视觉-语言-动作模型作为一种受到大模型发展启发的机器人通用控制模型,提高了具身智能系统中智能体与环境交互的能力,大大扩展了具身智能机器... 具身智能系统通过智能体与环境不断交互,从而提升智能体能力,受到学术界和产业界的广泛关注.视觉-语言-动作模型作为一种受到大模型发展启发的机器人通用控制模型,提高了具身智能系统中智能体与环境交互的能力,大大扩展了具身智能机器人的应用场景.本文对具身操作中的视觉-语言-动作模型进行综述.首先,详细介绍视觉-语言-动作模型的发展历程.然后,对视觉-语言-动作模型架构、训练数据、预训练方法、后训练方法和模型评估5个方面的研究现状进行详细分析.最后,针对视觉-语言-动作模型发展过程和落地应用中面临的挑战和未来可能的发展方向进行总结. 展开更多
关键词 具身智能 视觉-语言-动作模型 机器人 基础模型
在线阅读 下载PDF
融合视觉语言模型的柚子分形树图像生成增强方法
19
作者 赖力潜 段洁利 +1 位作者 杨洲 袁浩天 《农业机械学报》 北大核心 2026年第1期311-318,338,共9页
为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分... 为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型(Grounded segment anything model,Grounded SAM)提取柚树组件,然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景,最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证,在自建的非结构化环境柚子目标检测数据集上,当训练集真实图像数量分别为0、8、16、32、64幅时,使用本文方法后模型多阈值平均精度均值(Mean average precision at intersection over union thresholds from 0.50 to 0.95,mAP50-95)提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅,生成图像数量为512幅时,模型达到最优性能:精确率为76.9%,召回率为62.7%,mAP50为70.3%,mAP50-95为38.4%。迁移到橙子目标检测任务,相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅,生成图像数量为512幅时,模型同样达到最优性能:精确率为90.3%,召回率为87.8%,mAP50为94.0%,mAP50-95为54.0%。试验结果表明,该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据,提高YOLO v10轻量化版本目标检测的性能,并展现出良好的泛化能力。 展开更多
关键词 柚子目标检测 生成式数据增强 少样本学习 视觉语言模型
在线阅读 下载PDF
面向空间机器人的多模态大模型研究现状及应用前景
20
作者 罗涛 张亚航 王耀兵 《航天器工程》 北大核心 2026年第1期128-136,共9页
随着载人航天、深空探测和在轨服务等航天任务的快速推进,空间智能机器人的高自主性、强鲁棒性与复杂环境适应性需求日益凸显。文章系统梳理视觉-语言-动作模型关键技术,总结国内外主要研究进展,从任务规划策略和端到端控制策略两个维... 随着载人航天、深空探测和在轨服务等航天任务的快速推进,空间智能机器人的高自主性、强鲁棒性与复杂环境适应性需求日益凸显。文章系统梳理视觉-语言-动作模型关键技术,总结国内外主要研究进展,从任务规划策略和端到端控制策略两个维度分析其代表性工作,并结合空间机器人作业场景,深入分析其在空间机器人环境感知、语义理解、任务规划和操作执行等场景的突出应用潜力,重点探讨了空间机器人对多模态大模型的应用需求。在此基础上,结合我国空间机器人技术的发展现状,研究着重提出了从软硬件设计、模型应用能力与智能生态建设等多角度的面向未来空间智能机器人的多模态大模型技术发展策略,为后续空间机器人在载人航天、深空探测、在轨服务等领域的复杂作业任务中智能化应用提供参考。 展开更多
关键词 空间机器人 具身智能 视觉-语言-动作模型
在线阅读 下载PDF
上一页 1 2 84 下一页 到第
使用帮助 返回顶部