Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interact...Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.展开更多
The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.I...The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性...目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性。方法 收集我院401例TI-RADS 4b类甲状腺结节的超声影像数据,并利用这些数据对Vision-LSTM模型进行训练和验证。将AI模型的诊断结果与初级医生及高级医生的诊断结果进行对比,评估其在诊断准确性、稳定性等方面的表现;采用曲线下面积(area under the curve,AUC)、精确率-召回率(precision-recall,PR)曲线等指标对模型性能进行量化分析。结果 在独立验证中,Vision-LSTM模型的AUC(0.88)与准确率(89.4%)均显著高于初级医生(AUC:0.624),并达到与高级医生(AUC:0.787)相当的水平,证明了其辅助诊断的应用潜力。AI模型能够准确识别超声影像中的复杂特征,稳定输出一致的诊断结果,展现出较高的准确性和可靠性。结论 基于Vision-LSTM模型的AI技术可显著提升TI-RADS 4b类甲状腺结节的诊断效率与准确性,为医生提供有效辅助,减轻工作负担。展开更多
This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-or...This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system.展开更多
This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axi...This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.展开更多
Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved percepti...Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry.展开更多
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli...Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.展开更多
The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between ne...The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between neutrons and samples to achieve the qualitative and quantitative detection of sample components.In this study,we present a novel method for identifying copper grade by combining the vision transformer(ViT)model with the PGNAA technique.First,a Monte Carlo simulation is employed to determine the optimal sizes of the neutron moderator,thermal neutron absorption material,and dimensions of the device.Subsequently,based on the parameters obtained through optimization,a PGNAA copper ore measurement model is established.The gamma spectrum of the copper ore is analyzed using the ViT model.The ViT model is optimized for hyperparameters using a grid search.To ensure the reliability of the identification results,the test results are obtained through five repeated tenfold cross-validations.Long short-term memory and convolutional neural network models are compared with the ViT method.These results indicate that the ViT method is efficient in identifying copper ore grades with average accuracy,precision,recall,F_(1)score,and F_(1)(-)score values of 0.9795,0.9637,0.9614,0.9625,and 0.9942,respectively.When identifying associated minerals,the ViT model can identify Pb,Zn,Fe,and Co minerals with identification accuracies of 0.9215,0.9396,0.9966,and 0.8311,respectively.展开更多
Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora...Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora images to predict local geomagnetic station component,breaking the spatial limitations of geomagnetic stations.Our method utilizes the ViT backbone model in combination with convolutional networks to capture both the large-scale spatial correlation and distinct local feature correlation between aurora images and geomagnetic station data.Essentially,the model comprises a visual geometry group(VGG)image feature extraction network,a ViT-based encoder network,and a regression prediction network.Our experimental findings indicate that global features of aurora images play a more substantial role in predicting geomagnetic data than local features.Specifically,the hybrid model achieves a 39.1%reduction in root mean square error compared to the VGG model,a 29.5%reduction compared to the ViT model and a 35.3%reduction relative to the residual network(ResNet)model.Moreover,the fitting accuracy of the model surpasses that of the VGG,ViT,and ResNet models by 2.14%1.58%,and 4.1%,respectively.展开更多
In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object position...In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object positioning.The random fern is used in the coarse matching to identify objects in the left and right images,and the pixel coordinates of the object center points in the two images are calculated to complete the center matching.In the fine matching,the right center point is viewed as an estimated value to set the search range of the right image,in which the region matching is implemented to find the best matched point of the left center point.Then,the similar triangle principle of the binocular vision model is used to calculate the 3D coordinates of the center point,achieving fast and accurate object positioning.Finally,the proposed method is applied to the object scene images and the robotic arm grasping platform.The experimental results show that the average absolute positioning error and average relative positioning error of the proposed method are 8.22 mm and 1.96%respectively when the object's depth distance is within 600 mm,the time consumption is less than 1.029s.The method can meet the needs of the robot grasping system,and has better accuracy and robustness.展开更多
为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分...为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型(Grounded segment anything model,Grounded SAM)提取柚树组件,然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景,最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证,在自建的非结构化环境柚子目标检测数据集上,当训练集真实图像数量分别为0、8、16、32、64幅时,使用本文方法后模型多阈值平均精度均值(Mean average precision at intersection over union thresholds from 0.50 to 0.95,mAP50-95)提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅,生成图像数量为512幅时,模型达到最优性能:精确率为76.9%,召回率为62.7%,mAP50为70.3%,mAP50-95为38.4%。迁移到橙子目标检测任务,相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅,生成图像数量为512幅时,模型同样达到最优性能:精确率为90.3%,召回率为87.8%,mAP50为94.0%,mAP50-95为54.0%。试验结果表明,该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据,提高YOLO v10轻量化版本目标检测的性能,并展现出良好的泛化能力。展开更多
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
基金supported by the National Key R&D Program of China[2022YFF0902703]the State Administration for Market Regulation Science and Technology Plan Project(2024MK033).
文摘Recommendation systems are key to boosting user engagement,satisfaction,and retention,particularly on media platforms where personalized content is vital.Sequential recommendation systems learn from user-item interactions to predict future items of interest.However,many current methods rely on unique user and item IDs,limiting their ability to represent users and items effectively,especially in zero-shot learning scenarios where training data is scarce.With the rapid development of Large Language Models(LLMs),researchers are exploring their potential to enhance recommendation systems.However,there is a semantic gap between the linguistic semantics of LLMs and the collaborative semantics of recommendation systems,where items are typically indexed by IDs.Moreover,most research focuses on item representations,neglecting personalized user modeling.To address these issues,we propose a sequential recommendation framework using LLMs,called CIT-Rec,a model that integrates Collaborative semantics for user representation and Image and Text information for item representation to enhance Recommendations.Specifically,by aligning intuitive image information with text containing semantic features,we can more accurately represent items,improving item representation quality.We focus not only on item representations but also on user representations.To more precisely capture users’personalized preferences,we use traditional sequential recommendation models to train on users’historical interaction data,effectively capturing behavioral patterns.Finally,by combining LLMs and traditional sequential recommendation models,we allow the LLM to understand linguistic semantics while capturing collaborative semantics.Extensive evaluations on real-world datasets show that our model outperforms baseline methods,effectively combining user interaction history with item visual and textual modalities to provide personalized recommendations.
基金the support from the National Natural Science Foundation of China(Grant Nos.12173027,12303105,12173062)the National Key R&D Program of China(Grant Nos.2023YFF0725300,2022YFF0503402)+5 种基金the Science Research Grants from the Square Kilometre Array(SKA)(2020SKA0110100)the Science Research Grants from the China Manned Space Project(Grant Nos.CMS-CSST-2021-A01,CMS-CSST-2021-A07,CMS-CSST-2021-B05)the CAS Project for Young Scientists in Basic ResearchChina(Grant No.YSBR-062)supported by the Young Data Scientist Project of the National Astronomical Data Centerthe Program of Science and Education Integration at the School of Astronomy and Space Science,University of Chinese Academy of Sciences,China。
文摘The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
文摘目的 探讨基于Vision-LSTM的人工智能(artificial intelligence,AI)技术对甲状腺影像报告与数据系统4b (Thyroid Imaging Reporting and Data System Category 4b,TI-RADS 4b)类甲状腺结节的超声诊断准确性,评估其辅助临床决策的可行性。方法 收集我院401例TI-RADS 4b类甲状腺结节的超声影像数据,并利用这些数据对Vision-LSTM模型进行训练和验证。将AI模型的诊断结果与初级医生及高级医生的诊断结果进行对比,评估其在诊断准确性、稳定性等方面的表现;采用曲线下面积(area under the curve,AUC)、精确率-召回率(precision-recall,PR)曲线等指标对模型性能进行量化分析。结果 在独立验证中,Vision-LSTM模型的AUC(0.88)与准确率(89.4%)均显著高于初级医生(AUC:0.624),并达到与高级医生(AUC:0.787)相当的水平,证明了其辅助诊断的应用潜力。AI模型能够准确识别超声影像中的复杂特征,稳定输出一致的诊断结果,展现出较高的准确性和可靠性。结论 基于Vision-LSTM模型的AI技术可显著提升TI-RADS 4b类甲状腺结节的诊断效率与准确性,为医生提供有效辅助,减轻工作负担。
文摘This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system.
文摘This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.
文摘Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry.
基金Supported by the Open Project Program of Panxi Crops Research and Utilization Key Laboratory of Sichuan Province,No.SZKF202302the Fundamental Research Funds for the Central Universities No.2019CDYGYB024.
文摘Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.
基金supported by the National Natural Science Foundation of China(Nos.U2BB2077 and 42374226)the Natural Science Foundation of Jiangxi Province(20232BAB201043 and 20232BCJ23006)the Nuclear energy development project of the National Defense Science and Industry Bureau(Nos.20201192-01,20201192-03).
文摘The identification of ore grades is a critical step in mineral resource exploration and mining.Prompt gamma neutron activation analysis(PGNAA)technology employs gamma rays generated by the nuclear reactions between neutrons and samples to achieve the qualitative and quantitative detection of sample components.In this study,we present a novel method for identifying copper grade by combining the vision transformer(ViT)model with the PGNAA technique.First,a Monte Carlo simulation is employed to determine the optimal sizes of the neutron moderator,thermal neutron absorption material,and dimensions of the device.Subsequently,based on the parameters obtained through optimization,a PGNAA copper ore measurement model is established.The gamma spectrum of the copper ore is analyzed using the ViT model.The ViT model is optimized for hyperparameters using a grid search.To ensure the reliability of the identification results,the test results are obtained through five repeated tenfold cross-validations.Long short-term memory and convolutional neural network models are compared with the ViT method.These results indicate that the ViT method is efficient in identifying copper ore grades with average accuracy,precision,recall,F_(1)score,and F_(1)(-)score values of 0.9795,0.9637,0.9614,0.9625,and 0.9942,respectively.When identifying associated minerals,the ViT model can identify Pb,Zn,Fe,and Co minerals with identification accuracies of 0.9215,0.9396,0.9966,and 0.8311,respectively.
基金supported by the National Natural Science Foundation of China(No.41471381)the General Project of Jiangsu Natural Science Foundation(No.BK20171410)the Major Scientific and Technological Achievements Cultivation Fund of Nanjing University of Aeronautics and Astronautics(No.1011-XBD23002)。
文摘Accurately predicting geomagnetic field is of great significance for space environment monitoring and space weather forecasting worldwide.This paper proposes a vision Transformer(ViT)hybrid model that leverages aurora images to predict local geomagnetic station component,breaking the spatial limitations of geomagnetic stations.Our method utilizes the ViT backbone model in combination with convolutional networks to capture both the large-scale spatial correlation and distinct local feature correlation between aurora images and geomagnetic station data.Essentially,the model comprises a visual geometry group(VGG)image feature extraction network,a ViT-based encoder network,and a regression prediction network.Our experimental findings indicate that global features of aurora images play a more substantial role in predicting geomagnetic data than local features.Specifically,the hybrid model achieves a 39.1%reduction in root mean square error compared to the VGG model,a 29.5%reduction compared to the ViT model and a 35.3%reduction relative to the residual network(ResNet)model.Moreover,the fitting accuracy of the model surpasses that of the VGG,ViT,and ResNet models by 2.14%1.58%,and 4.1%,respectively.
基金supported by National Natural Science Foundation of China(No.61125101)。
文摘In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object positioning.The random fern is used in the coarse matching to identify objects in the left and right images,and the pixel coordinates of the object center points in the two images are calculated to complete the center matching.In the fine matching,the right center point is viewed as an estimated value to set the search range of the right image,in which the region matching is implemented to find the best matched point of the left center point.Then,the similar triangle principle of the binocular vision model is used to calculate the 3D coordinates of the center point,achieving fast and accurate object positioning.Finally,the proposed method is applied to the object scene images and the robotic arm grasping platform.The experimental results show that the average absolute positioning error and average relative positioning error of the proposed method are 8.22 mm and 1.96%respectively when the object's depth distance is within 600 mm,the time consumption is less than 1.029s.The method can meet the needs of the robot grasping system,and has better accuracy and robustness.
文摘为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型(Grounded segment anything model,Grounded SAM)提取柚树组件,然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景,最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证,在自建的非结构化环境柚子目标检测数据集上,当训练集真实图像数量分别为0、8、16、32、64幅时,使用本文方法后模型多阈值平均精度均值(Mean average precision at intersection over union thresholds from 0.50 to 0.95,mAP50-95)提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅,生成图像数量为512幅时,模型达到最优性能:精确率为76.9%,召回率为62.7%,mAP50为70.3%,mAP50-95为38.4%。迁移到橙子目标检测任务,相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅,生成图像数量为512幅时,模型同样达到最优性能:精确率为90.3%,召回率为87.8%,mAP50为94.0%,mAP50-95为54.0%。试验结果表明,该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据,提高YOLO v10轻量化版本目标检测的性能,并展现出良好的泛化能力。