Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved percepti...Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry.展开更多
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli...Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.展开更多
为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分...为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型(Grounded segment anything model,Grounded SAM)提取柚树组件,然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景,最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证,在自建的非结构化环境柚子目标检测数据集上,当训练集真实图像数量分别为0、8、16、32、64幅时,使用本文方法后模型多阈值平均精度均值(Mean average precision at intersection over union thresholds from 0.50 to 0.95,mAP50-95)提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅,生成图像数量为512幅时,模型达到最优性能:精确率为76.9%,召回率为62.7%,mAP50为70.3%,mAP50-95为38.4%。迁移到橙子目标检测任务,相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅,生成图像数量为512幅时,模型同样达到最优性能:精确率为90.3%,召回率为87.8%,mAP50为94.0%,mAP50-95为54.0%。试验结果表明,该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据,提高YOLO v10轻量化版本目标检测的性能,并展现出良好的泛化能力。展开更多
Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step...Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.展开更多
The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically co...The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically complex areas.However,current UAV-based detection methods face several critical limitations,including constrained deployment frequency,limited availability of annotated defect data,and the lack of mature risk assessment frameworks.To address these challenges,this study introduces a novel approach that integrates diffusion models with Large Language Models(LLMs)to generate highquality synthetic defect images tailored to railway slope scenarios.Furthermore,an improved transformerbased architecture is proposed,incorporating attention mechanisms and LLM-guided diffusion-generated imagery to enhance defect recognition performance under complex environmental conditions.Experimental evaluations conducted on a dataset of 300 field-collected images from high-risk railway slopes demonstrate that the proposed method significantly outperforms existing baselines in terms of precision,recall,and robustness,indicating strong applicability for real-world railway infrastructure monitoring and disaster prevention.展开更多
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
文摘Large models,such as large language models(LLMs),vision-language models(VLMs),and multimodal agents,have become key elements in artificial intelli⁃gence(AI)systems.Their rapid development has greatly improved perception,generation,and decision-making in various fields.However,their vast scale and complexity bring about new security challenges.Issues such as backdoor vulnerabilities during training,jailbreaking in multimodal rea⁃soning,and data provenance and copyright auditing have made security a critical focus for both academia and industry.
基金Supported by the Open Project Program of Panxi Crops Research and Utilization Key Laboratory of Sichuan Province,No.SZKF202302the Fundamental Research Funds for the Central Universities No.2019CDYGYB024.
文摘Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.
文摘为了降低柚子等水果目标检测对大量标注数据的依赖,本文提出了一种融合视觉语言模型的柚子分形树图像生成增强方法。该方法仅需3~5幅无标注真实图像,即可在无训练条件下生成大规模带标注的训练数据集。首先利用基于文本提示的零样本分割模型(Grounded segment anything model,Grounded SAM)提取柚树组件,然后结合稳定扩散模型Stable Diffusion使用文本提示生成随机背景,最后使用改进的分形树算法生成柚树以提升多样性及真实感。试验采用YOLO v10轻量化版本进行验证,在自建的非结构化环境柚子目标检测数据集上,当训练集真实图像数量分别为0、8、16、32、64幅时,使用本文方法后模型多阈值平均精度均值(Mean average precision at intersection over union thresholds from 0.50 to 0.95,mAP50-95)提升率依次达到662.3%、24.9%、13.7%、8.8%、1.8%。当训练集中真实图像数量为221幅,生成图像数量为512幅时,模型达到最优性能:精确率为76.9%,召回率为62.7%,mAP50为70.3%,mAP50-95为38.4%。迁移到橙子目标检测任务,相同数据规模下的性能提升分别为212.9%、16.5%、14.0%、5.2%、4.1%。当训练集中真实图像数量为1302幅,生成图像数量为512幅时,模型同样达到最优性能:精确率为90.3%,召回率为87.8%,mAP50为94.0%,mAP50-95为54.0%。试验结果表明,该图像生成增强方法在零样本和少样本学习场景中能够有效扩展训练数据,提高YOLO v10轻量化版本目标检测的性能,并展现出良好的泛化能力。
文摘Over the past decade,large-scale pre-trained autoregressive and diffusion models rejuvenated the field of text-guided image generation.However,these models require enormous datasets and parameters,and their multi-step generation processes are often inefficient and difficult to control.To address these challenges,we propose CAFE-GAN,a CLIP-Projected GAN with Attention-Aware Generation and Multi-Scale Discrimination,which incorporates a pretrained CLIP model along with several key architectural innovations.First,we embed a coordinate attention mechanism into the generator to capture long-range dependencies and enhance feature representation.Second,we introduce a trainable linear projection layer after the CLIP text encoder,which aligns textual embeddings with the generator’s semantic space.Third,we design a multi-scale discriminator that leverages pre-trained visual features and integrates a feature regularization strategy,thereby improving training stability and discrimination performance.Experiments on the CUB and COCO datasets demonstrate that CAFE-GAN outperforms existing text-to-image generation methods,achieving lower Fréchet Inception Distance(FID)scores and generating images with superior visual quality and semantic fidelity,with FID scores of 9.84 and 5.62 on the CUB and COCO datasets,respectively,surpassing current state-of-the-art text-to-image models by varying degrees.These findings offer valuable insights for future research on efficient,controllable text-to-image synthesis.
基金supported in part by the National Natural Science Foundation of China under Grant 52432012in part by the Shanghai Science and Technology Project with 25ZR1402508。
文摘The use of Unmanned Aerial Vehicles(UAVs)for defect detection on railway slopes is becoming increasingly widespread due to their ability to capture high-resolution images over large,inaccessible,and topographically complex areas.However,current UAV-based detection methods face several critical limitations,including constrained deployment frequency,limited availability of annotated defect data,and the lack of mature risk assessment frameworks.To address these challenges,this study introduces a novel approach that integrates diffusion models with Large Language Models(LLMs)to generate highquality synthetic defect images tailored to railway slope scenarios.Furthermore,an improved transformerbased architecture is proposed,incorporating attention mechanisms and LLM-guided diffusion-generated imagery to enhance defect recognition performance under complex environmental conditions.Experimental evaluations conducted on a dataset of 300 field-collected images from high-risk railway slopes demonstrate that the proposed method significantly outperforms existing baselines in terms of precision,recall,and robustness,indicating strong applicability for real-world railway infrastructure monitoring and disaster prevention.