In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become promine...In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.展开更多
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Langua...human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing.展开更多
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si...The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.展开更多
In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural lang...In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.展开更多
The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants furthe...The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications.展开更多
Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-langua...Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.展开更多
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi...In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.展开更多
Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLM...Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.展开更多
We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior ...We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.展开更多
In this paper,we establish and study a single-species logistic model with impulsive age-selective harvesting.First,we prove the ultimate boundedness of the solutions of the system.Then,we obtain conditions for the asy...In this paper,we establish and study a single-species logistic model with impulsive age-selective harvesting.First,we prove the ultimate boundedness of the solutions of the system.Then,we obtain conditions for the asymptotic stability of the trivial solution and the positive periodic solution.Finally,numerical simulations are presented to validate our results.Our results show that age-selective harvesting is more conducive to sustainable population survival than non-age-selective harvesting.展开更多
In recent years,there has been an increasing need for climate information across diverse sectors of society.This demand has arisen from the necessity to adapt to and mitigate the impacts of climate variability and cha...In recent years,there has been an increasing need for climate information across diverse sectors of society.This demand has arisen from the necessity to adapt to and mitigate the impacts of climate variability and change.Likewise,this period has seen a significant increase in our understanding of the physical processes and mechanisms that drive precipitation and its variability across different regions of Africa.By leveraging a large volume of climate model outputs,numerous studies have investigated the model representation of African precipitation as well as underlying physical processes.These studies have assessed whether the physical processes are well depicted and whether the models are fit for informing mitigation and adaptation strategies.This paper provides a review of the progress in precipitation simulation overAfrica in state-of-the-science climate models and discusses the major issues and challenges that remain.展开更多
Utilizing finite element analysis,the ballistic protection provided by a combination of perforated D-shaped and base armor plates,collectively referred to as radiator armor,is evaluated.ANSYS Explicit Dynamics is empl...Utilizing finite element analysis,the ballistic protection provided by a combination of perforated D-shaped and base armor plates,collectively referred to as radiator armor,is evaluated.ANSYS Explicit Dynamics is employed to simulate the ballistic impact of 7.62 mm armor-piercing projectiles on Aluminum AA5083-H116 and Steel Secure 500 armors,focusing on the evaluation of material deformation and penetration resistance at varying impact points.While the D-shaped armor plate is penetrated by the armor-piercing projectiles,the combination of the perforated D-shaped and base armor plates successfully halts penetration.A numerical model based on the finite element method is developed using software such as SolidWorks and ANSYS to analyze the interaction between radiator armor and bullet.The perforated design of radiator armor is to maintain airflow for radiator function,with hole sizes smaller than the bullet core diameter to protect radiator assemblies.Predictions are made regarding the brittle fracture resulting from the projectile core′s bending due to asymmetric impact,and the resulting fragments failed to penetrate the perforated base armor plate.Craters are formed on the surface of the perforated D-shaped armor plate due to the impact of projectile fragments.The numerical model accurately predicts hole growth and projectile penetration upon impact with the armor,demonstrating effective protection of the radiator assemblies by the radiator armor.展开更多
The National Geophysical Data Center(NGDC)of the United States has collected aeromagnetic data for input into a series of geomagnetic models to improve model resolution;however,in the Tibetan Plateau region,ground-bas...The National Geophysical Data Center(NGDC)of the United States has collected aeromagnetic data for input into a series of geomagnetic models to improve model resolution;however,in the Tibetan Plateau region,ground-based observations remain insufficient to clearly reflect the characteristics of the region’s lithospheric magnetism.In this study,we evaluate the lithospheric magnetism of the Tibetan Plateau by using a 3D surface spline model based on observations from>200 newly constructed repeat stations(portable stations)to determine the spatial distribution of plateau geomagnetism,as well as its correlation with the tectonic features of the region.We analyze the relationships between M≥5 earthquakes and lithospheric magnetic field variations on the Tibetan Plateau and identify regions susceptible to strong earthquakes.We compare the geomagnetic results with those from an enhanced magnetic model(EMM2015)developed by the NGDC and provide insights into improving lithospheric magnetic field calculations in the Tibetan Plateau region.Further research reveals that these magnetic anomalies exhibit distinct differences from the magnetic-seismic correlation mechanisms observed in other tectonic settings;here,they are governed primarily by the combined effects of compressional magnetism,thermal magnetism,and deep thermal stress.This study provides new evidence of geomagnetic anomalies on the Tibetan Plateau,interprets them physically,and demonstrates their potential for identifying seismic hazard zones on the Plateau.展开更多
Climate model prediction has been improved by enhancing model resolution as well as the implementation of sophisticated physical parameterization and refinement of data assimilation systems[section 6.1 in Wang et al.(...Climate model prediction has been improved by enhancing model resolution as well as the implementation of sophisticated physical parameterization and refinement of data assimilation systems[section 6.1 in Wang et al.(2025)].In relation to seasonal forecasting and climate projection in the East Asian summer monsoon season,proper simulation of the seasonal migration of rain bands by models is a challenging and limiting factor[section 7.1 in Wang et al.(2025)].展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
To investigate the influence of coarse aggregate parent rock properties on the elastic modulus of concrete,the mineralogical properties and stress-strain curves of granite and dolomite parent rocks,as well as the stre...To investigate the influence of coarse aggregate parent rock properties on the elastic modulus of concrete,the mineralogical properties and stress-strain curves of granite and dolomite parent rocks,as well as the strength and elastic modulus of mortar and concrete prepared with mechanism aggregates of the corresponding lithology,and the stress-strain curves of concrete were investigated.In this paper,a coarse aggregate and mortar matrix bonding assumption is proposed,and a prediction model for the elastic modulus of mortar is established by considering the lithology of the mechanism sand and the slurry components.An equivalent coarse aggregate elastic modulus model was established by considering factors such as coarse aggregate particle size,volume fraction,and mortar thickness between coarse aggregates.Based on the elastic modulus of the equivalent coarse aggregate and the remaining mortar,a prediction model for the elastic modulus of the two and three components of concrete in series and then in parallel was established,and the predicted values differed from the measured values within 10%.It is proposed that the coarse aggregate elastic modulus in highstrength concrete is the most critical factor affecting the elastic modulus of concrete,and as the coarse aggregate elastic modulus increases by 27.7%,the concrete elastic modulus increases by 19.5%.展开更多
Kinetic impact is the most practical planetary-defense technique,with momentum-transfer efficiency central to deflection design.We present a Monte Carlo photometric framework that couples ejecta sampling,dynamical evo...Kinetic impact is the most practical planetary-defense technique,with momentum-transfer efficiency central to deflection design.We present a Monte Carlo photometric framework that couples ejecta sampling,dynamical evolution,and image synthesis to compare directly with HST,LICIACube,ground-based and Lucy observations of the DART impact.Decomposing ejecta into(1)a highvelocity(~1600 m/s)plume exhibiting Na/K resonance,(2)a low-velocity(~1 m/s)conical component shaped by binary gravity and solar radiation pressure,and(3)meter-scale boulders,we quantify each component’s mass and momentum.Fitting photometric decay curves and morphological evolution yields size-velocity distributions and,via scaling laws,estimates of Dimorphos’bulk density,cratering parameters,and cohesive strength that agree with dynamical constraints.Photometric ejecta modeling therefore provides a robust route to constrain momentum enhancement and target properties,improving predictive capability for kinetic-deflection missions.展开更多
Customer churn is the rate at which customers discontinue doing business with a company over a given time period.It is an essential measure for businesses to monitor high churn rates,as they often indicate underlying ...Customer churn is the rate at which customers discontinue doing business with a company over a given time period.It is an essential measure for businesses to monitor high churn rates,as they often indicate underlying issues with services,products,or customer experience,resulting in considerable income loss.Prediction of customer churn is a crucial task aimed at retaining customers and maintaining revenue growth.Traditional machine learning(ML)models often struggle to capture complex temporal dependencies in client behavior data.To address this,an optimized deep learning(DL)approach using a Regularized Bidirectional Long Short-Term Memory(RBiLSTM)model is proposed to mitigate overfitting and improve generalization error.The model integrates dropout,L2-regularization,and early stopping to enhance predictive accuracy while preventing over-reliance on specific patterns.Moreover,this study investigates the effect of optimization techniques on boosting the training efficiency of the developed model.Experimental results on a recent public customer churn dataset demonstrate that the trained model outperforms the traditional ML models and some other DL models,such as Long Short-Term Memory(LSTM)and Deep Neural Network(DNN),in churn prediction performance and stability.The proposed approach achieves 96.1%accuracy,compared with LSTM and DNN,which attain 94.5%and 94.1%accuracy,respectively.These results confirm that the proposed approach can be used as a valuable tool for businesses to identify at-risk consumers proactively and implement targeted retention strategies.展开更多
This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to use...This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.展开更多
文摘In recent years,large vision-language models(VLMs)have achieved significant breakthroughs in cross-modal understanding and generation.However,the safety issues arising from their multimodal interactions become prominent.VLMs are vulnerable to jailbreak attacks,where attackers craft carefully designed prompts to bypass safety mechanisms,leading them to generate harmful content.To address this,we investigate the alignment between visual inputs and task execution,uncovering locality defects and attention biases in VLMs.Based on these findings,we propose VOTI,a novel jailbreak framework leveraging visual obfuscation and task induction.VOTI subtly embeds malicious keywords within neutral image layouts to evade detection,and breaks down harmful queries into a sequence of subtasks.This approach disperses malicious intent across modalities,exploiting VLMs’over-reliance on local visual cues and their fragility in multi-step reasoning to bypass global safety mechanisms.Implemented as an automated framework,VOTI integrates large language models as red-team assistants to generate and iteratively optimize jailbreak strategies.Extensive experiments across seven mainstream VLMs demonstrate VOTI’s effectiveness,achieving a 73.46%attack success rate on GPT-4o-mini.These results reveal critical vulnerabilities in VLMs,highlighting the urgent need for improving robust defenses and multimodal alignment.
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
基金Research Institute for Advanced Manufacturing(RIAM)of The Hong Kong Polytechnic University(1-CDJT)Intra-Faculty Interdisciplinary Project 2023/24(1-WZ4N)+6 种基金Research Committee of The Hong Kong Polytechnic UniversityState Key Laboratory of Intelligent Manufacturing Equipment and Technology,Huazhong University of Science and Technology(IMETKF2024010)Guangdong-Hong Kong Technology Cooperation Funding Scheme(GHX/075/22GD)Innovation and Technology Commission(ITC)COMAC International Collaborative Research Project(COMAC-SFGS-2023-3148)General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region,China(Project Nos.PolyU15210222 and PolyU15206723)Open access funding provided by the Hong Kong Polytechnic University.
文摘human-robot collaboration(HRC)is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision.The recent breakthrough of Large Language Models(LLMs)and Vision-Language Models(VLMs)has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field.However,despite the considerable amount of effort,existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs,especially for HRC in smart manufacturing scenarios.To fill the gap,this work offers a systematic review of the latest advance-ments and applications of VLMs in HRC for smart manu-facturing,which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs,their applications in robotic task planning,navigation,and manipulation,and role in enhancing human-robot skill transfer through multimodal data integration.Lastly,the paper discusses current limitations and future research directions in VLM-based HRC,highlighting the trend in fully realizing the potential of these technologies for smart manufacturing.
基金The Natural Science Foundation of Hebei Province(F2024501044).
文摘The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.
文摘In multimodal learning, Vision-Language Models (VLMs) have become a critical research focus, enabling the integration of textual and visual data. These models have shown significant promise across various natural language processing tasks, such as visual question answering and computer vision applications, including image captioning and image-text retrieval, highlighting their adaptability for complex, multimodal datasets. In this work, we review the landscape of Bootstrapping Language-Image Pre-training (BLIP) and other VLM techniques. A comparative analysis is conducted to assess VLMs’ strengths, limitations, and applicability across tasks while examining challenges such as scalability, data quality, and fine-tuning complexities. The work concludes by outlining potential future directions in VLM research, focusing on enhancing model interpretability, addressing ethical implications, and advancing multimodal integration in real-world applications.
基金supported by the National Natural Science Foundation of China(No.62176169)the Fundamental Research Funds for the Central Universities(Nankai University,070-63243150).
文摘The advent of large vision-language models(LVLMs)represents a remarkable advance in the quest for artificial general intelligence.However,the models’effectiveness in both specialized and general tasks warrants further investigation.This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks,respectively,aiming to offer a comprehensive understanding of these novel models.To gauge their effectiveness in specialized tasks,we employ six challenging tasks in three different application scenarios:natural,healthcare,and industrial.These six tasks include salient/camouflaged/transparent object detection,as well as polyp detection,skin lesion detection,and industrial anomaly detection.We examine the performance of three recent open-source LVLMs,including MiniGPT-v2,LLaVA-1.5,and Shikra,on both visual recognition and localization in these tasks.Moreover,we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V,assessing their multi-modal understanding capabilities in general tasks including object counting,absurd question answering,affordance reasoning,attribute recognition,and spatial relation reasoning.Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks.We delve deep into this inadequacy and uncover several potential factors,including limited cognition in specialized tasks,object hallucination,text-to-image interference,and decreased robustness in complex problems.We hope that this study can provide useful insights for the future development of LVLMs,helping researchers improve LVLMs for both general and specialized applications.
基金supported in part by the National Natural Science Foundation of China,No.62101136Shanghai Sailing Program,No.21YF1402800National Institutes of Health,Nos.R01CA237267,R01HL151561,R01EB031102,and R01EB032716.
文摘Large language models(LLMs),such as ChatGPT,have demonstrated impressive capabilities in various tasks and attracted increasing interest as a natural language interface across many domains.Recently,large vision-language models(VLMs)that learn rich vision–language correlation from image–text pairs,like BLIP-2 and GPT-4,have been intensively investigated.However,despite these developments,the application of LLMs and VLMs in image quality assessment(IQA),particularly in medical imaging,remains unexplored.This is valuable for objective performance evaluation and potential supplement or even replacement of radiologists’opinions.To this end,this study intro-duces IQAGPT,an innovative computed tomography(CT)IQA system that integrates image-quality captioning VLM with ChatGPT to generate quality scores and textual reports.First,a CT-IQA dataset comprising 1,000 CT slices with diverse quality levels is professionally annotated and compiled for training and evaluation.To better leverage the capabilities of LLMs,the annotated quality scores are converted into semantically rich text descriptions using a prompt template.Second,the image-quality captioning VLM is fine-tuned on the CT-IQA dataset to generate qual-ity descriptions.The captioning model fuses image and text features through cross-modal attention.Third,based on the quality descriptions,users verbally request ChatGPT to rate image-quality scores or produce radiological qual-ity reports.Results demonstrate the feasibility of assessing image quality using LLMs.The proposed IQAGPT outper-formed GPT-4 and CLIP-IQA,as well as multitask classification and regression models that solely rely on images.
基金supported by the National Natural Science Foundation of China (61702528,61806212)。
文摘In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.
文摘Model evaluation using benchmark datasets is an important method to measure the capability of large language models(LLMs)in specific domains,and it is mainly used to assess the knowledge and reasoning abilities of LLMs.Therefore,in order to better assess the capability of LLMs in the agricultural domain,Agri-Eval was proposed as a benchmark for assessing the knowledge and reasoning ability of LLMs in agriculture.The assessment dataset used in Agri-Eval covered seven major disciplines in the agricultural domain:crop science,horticulture,plant protection,animal husbandry,forest science,aquaculture science,and grass science,and contained a total of 2283 questions.Among domestic general-purpose LLMs,DeepSeek R1 performed best with an accuracy rate of 75.49%.In the realm of international general-purpose LLMs,Gemini 2.0 pro exp 0205 standed out as the top performer,achieving an accuracy rate of 74.28%.As an LLMs in agriculture vertical,Shennong V2.0 outperformed all the LLMs in China,and the answer accuracy rate of agricultural knowledge exceeded that of all the existing general-purpose LLMs.The launch of Agri-Eval helped the LLM developers to comprehensively evaluate the model's capability in the field of agriculture through a variety of tasks and tests to promote the development of the LLMs in the field of agriculture.
文摘We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.
基金Supported by the National Natural Science Foundation of China(12261018)Universities Key Laboratory of Mathematical Modeling and Data Mining in Guizhou Province(2023013)。
文摘In this paper,we establish and study a single-species logistic model with impulsive age-selective harvesting.First,we prove the ultimate boundedness of the solutions of the system.Then,we obtain conditions for the asymptotic stability of the trivial solution and the positive periodic solution.Finally,numerical simulations are presented to validate our results.Our results show that age-selective harvesting is more conducive to sustainable population survival than non-age-selective harvesting.
基金the World Climate Research Programme(WCRP),Climate Variability and Predictability(CLIVAR),and Global Energy and Water Exchanges(GEWEX)for facilitating the coordination of African monsoon researchsupport from the Center for Earth System Modeling,Analysis,and Data at the Pennsylvania State Universitythe support of the Office of Science of the U.S.Department of Energy Biological and Environmental Research as part of the Regional&Global Model Analysis(RGMA)program area。
文摘In recent years,there has been an increasing need for climate information across diverse sectors of society.This demand has arisen from the necessity to adapt to and mitigate the impacts of climate variability and change.Likewise,this period has seen a significant increase in our understanding of the physical processes and mechanisms that drive precipitation and its variability across different regions of Africa.By leveraging a large volume of climate model outputs,numerous studies have investigated the model representation of African precipitation as well as underlying physical processes.These studies have assessed whether the physical processes are well depicted and whether the models are fit for informing mitigation and adaptation strategies.This paper provides a review of the progress in precipitation simulation overAfrica in state-of-the-science climate models and discusses the major issues and challenges that remain.
文摘Utilizing finite element analysis,the ballistic protection provided by a combination of perforated D-shaped and base armor plates,collectively referred to as radiator armor,is evaluated.ANSYS Explicit Dynamics is employed to simulate the ballistic impact of 7.62 mm armor-piercing projectiles on Aluminum AA5083-H116 and Steel Secure 500 armors,focusing on the evaluation of material deformation and penetration resistance at varying impact points.While the D-shaped armor plate is penetrated by the armor-piercing projectiles,the combination of the perforated D-shaped and base armor plates successfully halts penetration.A numerical model based on the finite element method is developed using software such as SolidWorks and ANSYS to analyze the interaction between radiator armor and bullet.The perforated design of radiator armor is to maintain airflow for radiator function,with hole sizes smaller than the bullet core diameter to protect radiator assemblies.Predictions are made regarding the brittle fracture resulting from the projectile core′s bending due to asymmetric impact,and the resulting fragments failed to penetrate the perforated base armor plate.Craters are formed on the surface of the perforated D-shaped armor plate due to the impact of projectile fragments.The numerical model accurately predicts hole growth and projectile penetration upon impact with the armor,demonstrating effective protection of the radiator assemblies by the radiator armor.
基金supported by the CAS Pioneer Hundred Talents Program and Second Tibetan Plateau Scientific Expedition Research Program(2019QZKK0708)as well as the Basic Research Program of Qinghai Province:Lithospheric Geomagnetic Field of the Qinghai‒Tibet Plateau and the Relationship with Strong Earthquakes(2021-ZJ-969Q).
文摘The National Geophysical Data Center(NGDC)of the United States has collected aeromagnetic data for input into a series of geomagnetic models to improve model resolution;however,in the Tibetan Plateau region,ground-based observations remain insufficient to clearly reflect the characteristics of the region’s lithospheric magnetism.In this study,we evaluate the lithospheric magnetism of the Tibetan Plateau by using a 3D surface spline model based on observations from>200 newly constructed repeat stations(portable stations)to determine the spatial distribution of plateau geomagnetism,as well as its correlation with the tectonic features of the region.We analyze the relationships between M≥5 earthquakes and lithospheric magnetic field variations on the Tibetan Plateau and identify regions susceptible to strong earthquakes.We compare the geomagnetic results with those from an enhanced magnetic model(EMM2015)developed by the NGDC and provide insights into improving lithospheric magnetic field calculations in the Tibetan Plateau region.Further research reveals that these magnetic anomalies exhibit distinct differences from the magnetic-seismic correlation mechanisms observed in other tectonic settings;here,they are governed primarily by the combined effects of compressional magnetism,thermal magnetism,and deep thermal stress.This study provides new evidence of geomagnetic anomalies on the Tibetan Plateau,interprets them physically,and demonstrates their potential for identifying seismic hazard zones on the Plateau.
文摘Climate model prediction has been improved by enhancing model resolution as well as the implementation of sophisticated physical parameterization and refinement of data assimilation systems[section 6.1 in Wang et al.(2025)].In relation to seasonal forecasting and climate projection in the East Asian summer monsoon season,proper simulation of the seasonal migration of rain bands by models is a challenging and limiting factor[section 7.1 in Wang et al.(2025)].
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
基金Funded by State Railway Administration Research Project(No.2023JS007)National Natural Science Foundation of China(No.52438002)+1 种基金Research and Development Programs for Science and Technology of China Railways Corporation(No.J2023G003)New Cornerstone Science Foundation through the XPLORER PRIZE。
文摘To investigate the influence of coarse aggregate parent rock properties on the elastic modulus of concrete,the mineralogical properties and stress-strain curves of granite and dolomite parent rocks,as well as the strength and elastic modulus of mortar and concrete prepared with mechanism aggregates of the corresponding lithology,and the stress-strain curves of concrete were investigated.In this paper,a coarse aggregate and mortar matrix bonding assumption is proposed,and a prediction model for the elastic modulus of mortar is established by considering the lithology of the mechanism sand and the slurry components.An equivalent coarse aggregate elastic modulus model was established by considering factors such as coarse aggregate particle size,volume fraction,and mortar thickness between coarse aggregates.Based on the elastic modulus of the equivalent coarse aggregate and the remaining mortar,a prediction model for the elastic modulus of the two and three components of concrete in series and then in parallel was established,and the predicted values differed from the measured values within 10%.It is proposed that the coarse aggregate elastic modulus in highstrength concrete is the most critical factor affecting the elastic modulus of concrete,and as the coarse aggregate elastic modulus increases by 27.7%,the concrete elastic modulus increases by 19.5%.
基金supported by the National Natural Science Foundation of China(Grant No.12272018)the National Key Basic Research Project(2022JCJQZD20600).
文摘Kinetic impact is the most practical planetary-defense technique,with momentum-transfer efficiency central to deflection design.We present a Monte Carlo photometric framework that couples ejecta sampling,dynamical evolution,and image synthesis to compare directly with HST,LICIACube,ground-based and Lucy observations of the DART impact.Decomposing ejecta into(1)a highvelocity(~1600 m/s)plume exhibiting Na/K resonance,(2)a low-velocity(~1 m/s)conical component shaped by binary gravity and solar radiation pressure,and(3)meter-scale boulders,we quantify each component’s mass and momentum.Fitting photometric decay curves and morphological evolution yields size-velocity distributions and,via scaling laws,estimates of Dimorphos’bulk density,cratering parameters,and cohesive strength that agree with dynamical constraints.Photometric ejecta modeling therefore provides a robust route to constrain momentum enhancement and target properties,improving predictive capability for kinetic-deflection missions.
文摘Customer churn is the rate at which customers discontinue doing business with a company over a given time period.It is an essential measure for businesses to monitor high churn rates,as they often indicate underlying issues with services,products,or customer experience,resulting in considerable income loss.Prediction of customer churn is a crucial task aimed at retaining customers and maintaining revenue growth.Traditional machine learning(ML)models often struggle to capture complex temporal dependencies in client behavior data.To address this,an optimized deep learning(DL)approach using a Regularized Bidirectional Long Short-Term Memory(RBiLSTM)model is proposed to mitigate overfitting and improve generalization error.The model integrates dropout,L2-regularization,and early stopping to enhance predictive accuracy while preventing over-reliance on specific patterns.Moreover,this study investigates the effect of optimization techniques on boosting the training efficiency of the developed model.Experimental results on a recent public customer churn dataset demonstrate that the trained model outperforms the traditional ML models and some other DL models,such as Long Short-Term Memory(LSTM)and Deep Neural Network(DNN),in churn prediction performance and stability.The proposed approach achieves 96.1%accuracy,compared with LSTM and DNN,which attain 94.5%and 94.1%accuracy,respectively.These results confirm that the proposed approach can be used as a valuable tool for businesses to identify at-risk consumers proactively and implement targeted retention strategies.
基金funded by the Office of the Vice-President for Research and Development of Cebu Technological University.
文摘This study demonstrates a novel integration of large language models,machine learning,and multicriteria decision-making to investigate self-moderation in small online communities,a topic under-explored compared to user behavior and platform-driven moderation on social media.The proposed methodological framework(1)utilizes large language models for social media post analysis and categorization,(2)employs k-means clustering for content characterization,and(3)incorporates the TODIM(Tomada de Decisão Interativa Multicritério)method to determine moderation strategies based on expert judgments.In general,the fully integrated framework leverages the strengths of these intelligent systems in a more systematic evaluation of large-scale decision problems.When applied in social media moderation,this approach promotes nuanced and context-sensitive self-moderation by taking into account factors such as cultural background and geographic location.The application of this framework is demonstrated within Facebook groups.Eight distinct content clusters encompassing safety,harassment,diversity,and misinformation are identified.Analysis revealed a preference for content removal across all clusters,suggesting a cautious approach towards potentially harmful content.However,the framework also highlights the use of other moderation actions,like account suspension,depending on the content category.These findings contribute to the growing body of research on self-moderation and offer valuable insights for creating safer and more inclusive online spaces within smaller communities.