期刊文献+
共找到1,718篇文章
< 1 2 86 >
每页显示 20 50 100
Development and Application of a Large Vision Model for Railway Industry
1
作者 DAI Mingrui LI Wenhao +4 位作者 SHI Weifeng LI Guohua YANG Taocun DU Wenran SHEN Meiying(Translated) 《Chinese Railways》 2025年第2期3-15,共13页
Vision applications in the railway sector often face challenges such as complex and dynamic scenarios,coupled with a limited number of effective samples.Designing small models for individual scenarios is not only time... Vision applications in the railway sector often face challenges such as complex and dynamic scenarios,coupled with a limited number of effective samples.Designing small models for individual scenarios is not only time-consuming and resource-intensive but also fails to meet the diverse business needs.Therefore,developing large vision models specifically for the railway industry is of critical importance.This paper examines and explores potential application scenarios for large vision models within the railway sector,proposing a solution for their development.The research builds upon the UPerNet network,utilizing InternImage to replace the original backbone network,thereby enhancing the model's ability to capture details of image targets.To further improve model robustness,Semantic-Aware Normalization(SAN)and Semantic-Aware Whitening(SAW)attention mechanisms are introduced in place of the original pyramid pooling module.Additionally,the integration of spatial attention and channel attention replaces the original decoding part,allowing for dynamic adjustments to attention across various regions.Finally,datasets for railway scenarios were established through semi-automatic annotation.The experimental results indicate that the improved UPerNet_InternImage large vision model proposed for the railway industry has potential to enhance the segmentation accuracy and robustness.The model exhibits faster convergence speeds and improved effectiveness when tackling segmentation tasks in specific railway scenarios.It offers new insights and methodologies for addressing issues prevalent in railway vision scenarios. 展开更多
关键词 artificial intelligence deformable convolution attention mechanism semantic segmentation large vision model large models for railway industry
原文传递
基于Vision Mamba模型的渔业监测物种分类性能比较
2
作者 张泽海 黄小双 +2 位作者 孔祥洪 刘必林 陈新军 《上海海洋大学学报》 北大核心 2026年第2期508-519,共12页
渔业电子观察员(Electronic monitoring)是实施渔业智能化监管的重要手段,图像识别是其支撑的关键技术之一,如何解决边缘计算场景下部署高性能、轻量化模型是目前面临的挑战。本研究引入深度学习领域的Vision Mamba(ViM)模型,该模型利... 渔业电子观察员(Electronic monitoring)是实施渔业智能化监管的重要手段,图像识别是其支撑的关键技术之一,如何解决边缘计算场景下部署高性能、轻量化模型是目前面临的挑战。本研究引入深度学习领域的Vision Mamba(ViM)模型,该模型利用选择性状态空间机制(State space model,SSM)构建双向编码器,在保持线性计算复杂度的同时实现了对图像长距离依赖关系的全局建模。研究以自然保护协会渔业监测数据集为基础,与ResNet、EfficientNet、DeiT等主流模型开展了系统性的性能对比研究。结果显示,ViM模型在效率与精度上均表现出卓越性能。在轻量级模型中,ViM-Tiny在比ResNet-18基线模型少44.28%参数量的情况下,准确率提升了1.12%,F1分数提升了2.19%。在中量级模型中,ViM-Small在参数量相较ResNet-101基线模型减少44.65%的情况下,仍能实现与之接近持平的准确率(0.960 3)与F1分数(0.964 5)。研究表明,ViM模型能够在显著降低模型复杂度的同时,仍保持强大的渔业物种分类能力,在轻量化与高精度之间取得了很好的平衡。研究为构建高效、智能的渔业监管系统提供了新的技术路径。 展开更多
关键词 渔业电子观察员 图像分类 vision Mamba模型 深度学习 渔业监测数据集
原文传递
Video action recognition meets vision-language models exploring human factors in scene interaction: a review
3
作者 GUO Yuping GAO Hongwei +3 位作者 YU Jiahui GE Jinchao HAN Meng JU Zhaojie 《Optoelectronics Letters》 2025年第10期626-640,共15页
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions... Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions. 展开更多
关键词 human factors video action recognition vision language models analyze dynamic behaviors spatiotemporal granularity video action recognition var aims multimodal alignment scene interaction
原文传递
CIT-Rec:Enhancing Sequential Recommendation System with Large Language Models
4
作者 Ziyu Li Zhen Chen +2 位作者 Xuejing Fu Tong Mo Weiping Li 《Computers, Materials & Continua》 2026年第3期2328-2343,共16页
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact... Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations. 展开更多
关键词 Large language models vision language models sequential recommendation instruction tuning
在线阅读 下载PDF
A versatile framework for analyzing galaxy image data by incorporating Human-in-the-loop in a large vision model
5
作者 Ming-Xiang Fu Yu Song +14 位作者 Jia-Meng Lv Liang Cao Peng Jia Nan Li Xiang-Ru Li Ji-Feng Liu A-Li Luo Bo Qiu Shi-Yin Shen Liang-Ping Tu Li-Li Wang Shou-Lin Wei Hai-Feng Yang Zhen-Ping Yi Zhi-Qiang Zou 《Chinese Physics C》 SCIE CAS CSCD 2024年第9期176-187,共12页
The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.I... The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy. 展开更多
关键词 artificial intelligence large vision model human-in-the-loop ASTRONOMY galaxies
原文传递
基于残差注意力TCN与vision transformer的齿轮剩余寿命预测
6
作者 胡爱军 李晨阳 +2 位作者 邢磊 周卓浩 向玲 《航空动力学报》 北大核心 2025年第12期14-24,共11页
齿轮系统的运行状况受到多个因素的影响,这些因素在时间上存在长期依赖关系,并在局部和全局特征之间存在差异。为了有效地捕捉数据中的时间依赖性并自适应调整对特征的关注度,提出具有残差卷积块注意力机制的时间卷积网络(RCMTCN)。通... 齿轮系统的运行状况受到多个因素的影响,这些因素在时间上存在长期依赖关系,并在局部和全局特征之间存在差异。为了有效地捕捉数据中的时间依赖性并自适应调整对特征的关注度,提出具有残差卷积块注意力机制的时间卷积网络(RCMTCN)。通过在卷积块注意力机制中引入残差连接,模型能够同时关注原始输入和注意力加权的信息,提高了模型对局部信息的感知能力。在此基础上,将vision transformer(ViT)模型与RCMTCN相结合对齿轮的剩余使用寿命(RUL)预测,ViT模型能有效地捕获数据中的全局信息。两者融合后能充分展现在处理时间序列数据局部特征提取能力和全局信息关注方面的优势,提高对多维度特征的感知能力。最后,通过在两种工况齿轮性能退化数据集上对模型进行验证,选用点蚀故障数据进行训练,分别对点蚀和断齿故障进行测试。实验结果表明:与其他方法相比,所提出的方法能更充分地提取关键特征信息,在点蚀故障上评分函数得分为0.8898,且在断齿故障上得分为0.8587,表现出良好的工况、故障适应能力。 展开更多
关键词 齿轮 剩余使用寿命 时序网络 注意力机制 vision transformer模型
原文传递
Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis
7
作者 Jieyu An Wan Mohd Nazmee Wan Zainon Binfen Ding 《Intelligent Automation & Soft Computing》 SCIE 2023年第8期1673-1689,共17页
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on... Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment. 展开更多
关键词 Multimodal sentiment analysis vision–language pre-trained model contrastive learning sentiment classification
在线阅读 下载PDF
Vision-LSTM模型在甲状腺影像报告与数据系统4b类甲状腺结节超声影像诊断中的应用与评估
8
作者 张鑫茹 李扬 +2 位作者 孙萌 聂玮 马喆 《山东大学学报(医学版)》 北大核心 2025年第11期68-74,共7页
目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性... 目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性。方法 收集我院401例TI-RADS 4b类甲状腺结节的超声影像数据,并利用这些数据对Vision-LSTM模型进行训练和验证。将AI模型的诊断结果与初级医生及高级医生的诊断结果进行对比,评估其在诊断准确性、稳定性等方面的表现;采用曲线下面积(area under the curve,AUC)、精确率-召回率(precision-recall,PR)曲线等指标对模型性能进行量化分析。结果 在独立验证中,Vision-LSTM模型的AUC(0.88)与准确率(89.4%)均显著高于初级医生(AUC:0.624),并达到与高级医生(AUC:0.787)相当的水平,证明了其辅助诊断的应用潜力。AI模型能够准确识别超声影像中的复杂特征,稳定输出一致的诊断结果,展现出较高的准确性和可靠性。结论 基于Vision-LSTM模型的AI技术可显著提升TI-RADS 4b类甲状腺结节的诊断效率与准确性,为医生提供有效辅助,减轻工作负担。 展开更多
关键词 甲状腺影像报告与数据系统 甲状腺结节 vision-LSTM模型 诊断准确性 人工智能
原文传递
Modeling of a Linear Scanning 3D Vision Coordinate Measurement System
9
作者 孙玉芹 黄庆成 车仁生 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 1998年第3期32-35,共4页
This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-or... This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system. 展开更多
关键词 STRUCTURED light laser STRIPE sensor 3D vision CMM mathematic model EXTRINSIC CALIBRATION
在线阅读 下载PDF
Structured scene modeling using micro stereo vision system with large field of view
10
作者 颜世莹 朱玉文 +1 位作者 刘佳音 贾云得 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2001年第3期296-299,共4页
This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axi... This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation. 展开更多
关键词 Index terms structured scene modeling stereo vision wide field of view mobile robot
在线阅读 下载PDF
Special Topic on Security of Large Models
11
作者 SU Zhou DU Linkang 《ZTE Communications》 2025年第3期1-2,共2页
Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved percepti... Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry. 展开更多
关键词 large modelssuch SECURITY multimodal agentshave multimodal rea soningand large language models llms vision language data provenance copyright auditing backdoor vulnerabilities vision language models
在线阅读 下载PDF
Foundation models:Insights and implications for gastrointestinal cancer
12
作者 Lei Shi Rui Huang +1 位作者 Li-Ling Zhao An-Jie Guo 《World Journal of Gastroenterology》 2025年第47期7-34,共28页
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli... Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies. 展开更多
关键词 Foundation models Gastrointestinal cancers Large language models vision foundation models Multimodal large language models
在线阅读 下载PDF
High-precision copper-grade identification via a vision transformer with PGNAA
13
作者 Jie Cao Chong-Gui Zhong +6 位作者 Han-Ting You Yan Zhang Ren-Bo Wang Shu-Min Zhou Jin-Hui Qu Rui Chen Shi-Liang Liu 《Nuclear Science and Techniques》 2025年第7期89-99,共11页
The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between ne... The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between neutrons and samples to achieve the qualitative and quantitative detection of sample components.In this study,we present a novel method for identifying copper grade by combining the vision transformer(ViT)model with the PGNAA technique.First,a Monte Carlo simulation is employed to determine the optimal sizes of the neutron moderator,thermal neutron absorption material,and dimensions of the device.Subsequently,based on the parameters obtained through optimization,a PGNAA copper ore measurement model is established.The gamma spectrum of the copper ore is analyzed using the ViT model.The ViT model is optimized for hyperparameters using a grid search.To ensure the reliability of the identification results,the test results are obtained through five repeated tenfold cross-validations.Long short-term memory and convolutional neural network models are compared with the ViT method.These results indicate that the ViT method is efficient in identifying copper ore grades with average accuracy,precision,recall,F_(1)score,and F_(1)(-)score values of 0.9795,0.9637,0.9614,0.9625,and 0.9942,respectively.When identifying associated minerals,the ViT model can identify Pb,Zn,Fe,and Co minerals with identification accuracies of 0.9215,0.9396,0.9966,and 0.8311,respectively. 展开更多
关键词 Copper-grade identification vision transformer model Prompt gamma neutron activation analysis Monte Carlo N-particle
在线阅读 下载PDF
Local Geomagnetic Component Modeling of Auroral Images Based on Local‑Global Feature
14
作者 WANG Bo ZHANG Yuanshu +5 位作者 CHENG Wei TIAN Xinqin SHENG Qinghong LI Jun LING Xiao LIU Xiang 《Transactions of Nanjing University of Aeronautics and Astronautics》 2025年第6期710-727,共18页
Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora... Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora images to predict local geomagnetic station component,breaking the spatial limitations of geomagnetic stations.Our method utilizes the ViT backbone model in combination with convolutional networks to capture both the large-scale spatial correlation and distinct local feature correlation between aurora images and geomagnetic station data.Essentially,the model comprises a visual geometry group(VGG)image feature extraction network,a ViT-based encoder network,and a regression prediction network.Our experimental findings indicate that global features of aurora images play a more substantial role in predicting geomagnetic data than local features.Specifically,the hybrid model achieves a 39.1%reduction in root mean square error compared to the VGG model,a 29.5%reduction compared to the ViT model and a 35.3%reduction relative to the residual network(ResNet)model.Moreover,the fitting accuracy of the model surpasses that of the VGG,ViT,and ResNet models by 2.14%1.58%,and 4.1%,respectively. 展开更多
关键词 ultraviolet aurora image geomagnetic field prediction vision Transformer(ViT)hybrid model
在线阅读 下载PDF
结构文本驱动的小样本专业图像检测方法
15
作者 刘磊 袁永宏 +2 位作者 何海鹏 冯瀚森 王子珺 《计算机应用研究》 北大核心 2026年第2期385-392,共8页
针对专业图像检测中良品样本占比过高、异常样本稀缺以及视觉-语言预训练模型在垂直领域表现受限的问题,提出一种结构化文本驱动的专业图像检测方法。首先,通过抖动变换与宫格增强扩充有限异常样本,并结合区域级对齐的结构化文本提升样... 针对专业图像检测中良品样本占比过高、异常样本稀缺以及视觉-语言预训练模型在垂直领域表现受限的问题,提出一种结构化文本驱动的专业图像检测方法。首先,通过抖动变换与宫格增强扩充有限异常样本,并结合区域级对齐的结构化文本提升样本“智力密度”;其次,对双向变换表征模型进行适应性改造,引入宫格图像-结构文本对比学习和联合宫格语义-空间一致性双任务,实现跨模态全局与局部特征对齐;最后,将所建模型作为大语言模型的视觉编码器,提供关键检测特征,实现专业图像检测。在ABD-AD、MVTec-AD和VisA数据集上的小样本实验结果表明,所提模型在定位和分类任务上相比现有方法提升了3.10%和3.84%,验证了结构化文本在小样本异常检测中及其在专业图像检测场景下的优越性能。 展开更多
关键词 异常检测 结构文本 视觉语言模型 小样本学习 大语言模型 深度学习
在线阅读 下载PDF
具身智能数据采集与处理综述
16
作者 丁贵广 朱晨 +1 位作者 王潇婉 陈辉 《数据采集与处理》 北大核心 2026年第2期332-346,共15页
近年来,视觉-语言-动作(Vision-language-action,VLA)模型在具身智能领域受到广泛关注。随着模型规模不断扩大,其在复杂任务中的泛化能力持续提升,而模型性能的提升在很大程度上依赖于高质量、大规模训练数据。然而,与自然语言处理和计... 近年来,视觉-语言-动作(Vision-language-action,VLA)模型在具身智能领域受到广泛关注。随着模型规模不断扩大,其在复杂任务中的泛化能力持续提升,而模型性能的提升在很大程度上依赖于高质量、大规模训练数据。然而,与自然语言处理和计算机视觉领域可以直接利用互联网海量数据不同,具身智能数据通常涉及真实机器人与环境之间的物理交互,数据采集成本高、获取过程复杂。如何高效获取、处理并组织这些数据,已成为制约具身智能发展的关键问题。针对上述问题,本文对具身智能领域的数据采集与处理方法进行了系统梳理。首先,从数据来源与采集方式角度总结了当前主流的数据获取范式,并分析了不同范式在数据质量、规模潜力和采集成本等方面的特点与局限。其次,进一步总结了具身智能数据的标准化处理流程,重点分析了动作表示对齐、多模态时序同步、语言语义标准化以及数据质量控制等关键技术环节。最后,讨论了具身智能数据生态的发展趋势,指出目前遇到的困难以及未来可能的发展路径。本文的总结与分析可为具身智能领域数据集构建以及大规模机器人学习研究发展提供帮助。 展开更多
关键词 具身智能 视觉-语言-动作模型 机器人学习 大规模数据采集 数据处理
在线阅读 下载PDF
基于毫米波感知的皮革瑕疵分类方法
17
作者 张健 关灏文 《小型微型计算机系统》 北大核心 2026年第2期257-264,共8页
皮革瑕疵分类是确保皮革产品质量的关键环节.传统的人工检测和图像处理方法受限于光照等环境因素,难以满足高效检测需求.近年来,深度学习特别是卷积神经网络(CNN)的应用提高了瑕疵检测的准确性和效率,但仍受到环境影响.毫米波雷达技术... 皮革瑕疵分类是确保皮革产品质量的关键环节.传统的人工检测和图像处理方法受限于光照等环境因素,难以满足高效检测需求.近年来,深度学习特别是卷积神经网络(CNN)的应用提高了瑕疵检测的准确性和效率,但仍受到环境影响.毫米波雷达技术作为一种新兴的无损检测方法,因其强穿透性和不受光照等因素影响的特性而逐渐受到关注.文中提出了一种结合毫米波雷达与改进Vision Transformer模型的皮革瑕疵分类方法,利用毫米波雷达信号提取皮革瑕疵的时频特征,并通过深度学习模型进行分类,在自建数据集上达到了95.62%的准确率,相比经典的分类模型优势显著. 展开更多
关键词 毫米波雷达 皮革瑕疵分类 vision Transformer模型 迁移学习
在线阅读 下载PDF
医学影像大模型的演进、技术架构与临床展望评述
18
作者 李璐 孙怀强 《西部医学》 2026年第4期469-477,共9页
近年来,人工智能在医学影像分析领域正经历从“专用模型”向“基础模型”范式的转变。传统单任务模型高度依赖专家标注且缺乏跨任务泛化能力,而医学影像大模型(LMIMs)通过海量多模态数据自监督预训练,仅需少量微调即可适应多种下游任务... 近年来,人工智能在医学影像分析领域正经历从“专用模型”向“基础模型”范式的转变。传统单任务模型高度依赖专家标注且缺乏跨任务泛化能力,而医学影像大模型(LMIMs)通过海量多模态数据自监督预训练,仅需少量微调即可适应多种下游任务,是迈向医疗通用人工智能的关键路径。本文系统评述了医学影像大模型的最新研究进展。首先,将现有模型分为视觉基础模型、视觉-语言大模型以及通用与智能体模型三大类。其次,深入剖析了核心架构(如大核卷积神经网络、Vision Transformer及其混合架构)、对比学习、掩码建模等预训练学习范式。最后,探讨了数据构建与跨中心泛化的落地挑战,重点梳理了其在肿瘤等重大疾病中的临床应用潜力,并对结合因果推理、检索增强生成等技术破解部署瓶颈进行了展望。综上,医学影像大模型代表了医学人工智能发展的重要里程碑,未来有望深刻变革诊断流程,提升诊疗质量与效率,最终惠及全球患者健康。 展开更多
关键词 医学影像大模型 基础模型 自监督学习 视觉-语言模型 通用人工智能
暂未订购
基于多模态大模型的具身智能体研究进展与展望
19
作者 曹群 郭洁昕 雷升涛 《自动化博览》 2026年第1期95-99,共5页
具身智能体作为通过指令感知并作用于物理空间的智能实体,被视为通往通用人工智能的关键路径,在医疗辅助、智能教育及服务机器人等多元化场景中蕴含巨大潜力。近期,多模态大模型的跨越式发展赋予了具身实体卓越的语义解码、逻辑推演与... 具身智能体作为通过指令感知并作用于物理空间的智能实体,被视为通往通用人工智能的关键路径,在医疗辅助、智能教育及服务机器人等多元化场景中蕴含巨大潜力。近期,多模态大模型的跨越式发展赋予了具身实体卓越的语义解码、逻辑推演与跨模态感知能力,极大地加速了该范式的演进。然而,针对该领域蓬勃发展的研究现状,目前亟需系统化地回顾与深层次的剖析。本文旨在为科研人员构建宏观的研究图景,首先梳理了支撑具身智能的多模态底层技术。随后,从具身大模型架构、高层战略规划及底层精细控制三个核心维度展开深度论述。最后,针对现有研究的技术瓶颈与局限性提出了见解,并对具身智能的未来图景进行了展望,力求为该领域的持续创新提供参考指引。 展开更多
关键词 具身智能体 多模态大模型 机器人视觉语言模型 具身智能
在线阅读 下载PDF
具身雷达的概念、架构和发展
20
作者 徐丰 雒梅逸香 +5 位作者 卫江涛 许京伟 仇晓兰 武俊杰 万显荣 金亚秋 《上海航天(中英文)》 2026年第1期13-30,41,共19页
面向自主智能无人系统探测感知等未来需求,本文阐述具身雷达的概念——一种将雷达感知与平台机动、智能决策深度耦合的平台—雷达一体化自主感知系统。其核心在于革新传统雷达“固定模式、单向处理、被动感知”的体制限制,发展“感知—... 面向自主智能无人系统探测感知等未来需求,本文阐述具身雷达的概念——一种将雷达感知与平台机动、智能决策深度耦合的平台—雷达一体化自主感知系统。其核心在于革新传统雷达“固定模式、单向处理、被动感知”的体制限制,发展“感知—决策—动作”闭环的处理范式,使雷达能够主动选择探测方式、机动路径和交互策略,从而在动态目标、部分可观环境和强对抗电磁场景中实现性能提升。传统雷达多遵循按任务定制的设计思路,表现为探测模式固化、参数不可调、轨迹预设化,其信号处理链以单向数据流开环处理为主,缺乏对环境目标认知后进行自主调优能力,难以满足无人系统在复杂环境下实时建模与决策的需求。具身雷达通过将平台机动性、探测感知与智能体规划策略进行耦合,构建电磁世界模型以表征“电磁场—目标—环境—平台—雷达”动态关联,并通过交互式信息处理框架进行实时闭环反馈,从而实现探测策略与机动策略的联动优化。具身雷达基于无人系统突破具身智能感知范式,有望在复杂场景下显著提升探测效能与自主作业能力,对社会生产模式及未来无人作战体系的重塑具有重要意义。 展开更多
关键词 具身雷达 微波视觉 物理智能 电磁世界模型 交互式信息处理
在线阅读 下载PDF
上一页 1 2 86 下一页 到第
使用帮助 返回顶部