Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical impo...Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25% AUROC and 92.66% F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.展开更多
Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the ...Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the last two decades.Recently,transformer-based Pre-trained Language Models(PLM)have excelled in Natural Language Processing(NLP)tasks by leveraging large-scale training corpora.Increasing the scale of these models enhances performance significantly,introducing abilities like context learning that smaller models lack.The advancement in Large Language Models,exemplified by the development of ChatGPT,has made significant impacts both academically and industrially,capturing widespread societal interest.This survey provides an overview of the development and prospects from Large Language Models(LLM)to Large Multimodal Models(LMM).It first discusses the contributions and technological advancements of LLMs in the field of natural language processing,especially in text generation and language understanding.Then,it turns to the discussion of LMMs,which integrates various data modalities such as text,images,and sound,demonstrating advanced capabilities in understanding and generating cross-modal content,paving new pathways for the adaptability and flexibility of AI systems.Finally,the survey highlights the prospects of LMMs in terms of technological development and application potential,while also pointing out challenges in data integration,cross-modal understanding accuracy,providing a comprehensive perspective on the latest developments in this field.展开更多
Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for...Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for us to explore new me-thods and approaches concerning the prevention and treatment of mental illne-sses.Large multimodal models(LMMs),as the most advanced artificial intelligen-ce models(i.e.ChatGPT-4),have brought new hope to the accurate prevention,diagnosis,and treatment of psychiatric disorders.The assistance of these models in the promotion of mental health is critical,as the latter necessitates a strong foundation of medical knowledge and professional skills,emotional support,stigma mitigation,the encouragement of more honest patient self-disclosure,reduced health care costs,improved medical efficiency,and greater mental health service coverage.However,these models must address challenges related to health,safety,hallucinations,and ethics simultaneously.In the future,we should address these challenges by developing relevant usage manuals,accountability rules,and legal regulations;implementing a human-centered approach;and intelligently upgrading LMMs through the deep optimization of such models,their algorithms,and other means.This effort will thus substantially contribute not only to the maintenance of students’health but also to the achievement of global sustainable development goals.展开更多
Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the fail...Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the failure of traditional methods.To address the occlusion problem,this paper first proposes a novel occlusion handling method based on the large multimodal model(LMM).The method utilises an LMM to detect and analyse input RGB images and generates adjustment actions to progressively eliminate occlusion.After the occlusion is handled,an improved AOD method based on a deep Q-learning network(DQN)is used to complete the task.We introduce an attention mechanism to process image features,enabling the model to focus on critical regions of the input images.Additionally,a new reward function is proposed that comprehensively considers the bounding box of the target object and the robot's distance to the object,along with the actions performed by the robot.Ex-periments on the dataset and in real-world scenarios validate the effectiveness of the proposed method in performing AOD tasks under partial occlusion.展开更多
Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing...Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.展开更多
Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensiv...Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensive training datasets.The public release and rapid refinement of large language models(LLMs)like ChatGPT have accelerated the adoption of GAI across various medical specialties,offering new tools for education,clinical simulation,and research.Dermatology training,which heavily relies on visual pattern recognition and requires extensive exposure to diverse morphological presentations,faces persistent challenges such as uneven distribu-tion of educational resources,limited patient exposure for rare conditions,and variability in teaching quality.Exploring the integration of GAI into pedagogical frameworks offers innovative approaches to address these challenges,potentially enhancing the quality,standardization,scalability,and accessibility of dermatology ed-ucation.This comprehensive review examines the core concepts and technical foundations of GAI,highlights its specific applications within dermatology teaching and learning—including simulated case generation,per-sonalized learning pathways,and academic support—and discusses the current limitations,practical challenges,and ethical considerations surrounding its use.The aim is to provide a balanced perspective on the significant potential of GAI for transforming dermatology education and to offer evidence-based insights to guide future exploration,implementation,and policy development.展开更多
Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on se...Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on separate image and text modalities,which may not fully utilize the information available from both sources.To address this limitation,we propose a novel multimodal large model,i.e.,the PKME-MLM(Prior Knowledge and Multi-label Emotion analysis based Multimodal Large Model for sarcasm detection).The PKME-MLM aims to enhance sarcasm detection by integrating prior knowledge to extract useful textual information from images,which is then combined with text data for deeper analysis.This method improves the integration of image and text data,addressing the limitation of previous models that process these modalities separately.Additionally,we incorporate multi-label sentiment analysis,refining sentiment labels to improve sarcasm recognition accuracy.This design overcomes the limitations of prior models that treated sentiment classification as a single-label problem,thereby improving sarcasm recognition by distinguishing subtle emotional cues from the text.Experimental results demonstrate that our approach achieves significant performance improvements in multimodal sarcasm detection tasks,with an accuracy(Acc.)of 94.35%,and Macro-Average Precision and Recall reaching 93.92%and 94.21%,respectively.These results highlight the potential of multimodal models in improving sarcasm detection and suggest that further integration of modalities could advance future research.This work also paves the way for incorporating multimodal sentiment analysis into sarcasm detection.展开更多
BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and ...BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.展开更多
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si...The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.展开更多
User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore ...User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore the rich image resources generated by users,and the existing attempts touch on the multimodal domain,but still face the challenge of semantic differences between text and images.Given this,we investigate the UIL task across different social media platforms based on multimodal user-generated contents(UGCs).We innovatively introduce the efficient user identity linkage via aligned multi-modal features and temporal correlation(EUIL)approach.The method first generates captions for user-posted images with the BLIP model,alleviating the problem of missing textual information.Subsequently,we extract aligned text and image features with the CLIP model,which closely aligns the two modalities and significantly reduces the semantic gap.Accordingly,we construct a set of adapter modules to integrate the multimodal features.Furthermore,we design a temporal weight assignment mechanism to incorporate the temporal dimension of user behavior.We evaluate the proposed scheme on the real-world social dataset TWIN,and the results show that our method reaches 86.39%accuracy,which demonstrates the excellence in handling multimodal data,and provides strong algorithmic support for UIL.展开更多
This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large mode...This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large models in vertical industries,outlines the challenges and issues confronted in applying large models in the oil and gas sector,and offers prospects for the application of large models in the oil and gas industry.The existing large models can be briefly divided into three categories:large language models,visual large models,and multimodal large models.The application of large models in the oil and gas industry is still in its infancy.Based on open-source large language models,some oil and gas enterprises have released large language model products using methods like fine-tuning and retrieval augmented generation.Scholars have attempted to develop scenario-specific models for oil and gas operations by using visual/multimodal foundation models.A few researchers have constructed pre-trained foundation models for seismic data processing and interpretation,as well as core analysis.The application of large models in the oil and gas industry faces challenges such as current data quantity and quality being difficult to support the training of large models,high research and development costs,and poor algorithm autonomy and control.The application of large models should be guided by the needs of oil and gas business,taking the application of large models as an opportunity to improve data lifecycle management,enhance data governance capabilities,promote the construction of computing power,strengthen the construction of“artificial intelligence+energy”composite teams,and boost the autonomy and control of large model technology.展开更多
The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current i...The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current intelligent AM(IAM)systems face limitations such as fragmented AI tool usage and suboptimal human-machine interaction.This paper reviews existing IAM solutions,emphasizing control,monitoring,process autonomy,and end-to-end integration,and identifies key limitations,such as the absence of a high-level controller for global decision-making.To address these gaps,we propose a transition from IAM to autonomous AM,featuring a hierarchical framework with four integrated layers:knowledge,generative solution,operational,and cognitive.In the cognitive layer,AI agents notably enable machines to independently observe,analyze,plan,and execute operations that traditionally require human intervention.These capabilities streamline production processes and expand the possibilities for innovation,particularly in sectors like in-space manufacturing.Additionally,this paper discusses the role of AI in self-optimization and lifelong learning,positing that the future of AM will be characterized by a symbiotic relationship between human expertise and advanced autonomy,fostering a more adaptive,resilient manufacturing ecosystem.展开更多
The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensi...The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensitive words.Most existing approaches predominantly prioritize the acquisition of contextual knowledge from Chinese corpora and vocabularies during pretraining,often overlooking the inherent phonological and morphological characteristics of the Chinese language.To address these issues,we propose a shared-weight multimodal translation model(SMTM)based on multimodal information of Chinese characters,which integrates the phonology of Pinyin and the morphology of fonts into each Chinese character token to learn the deeper semantics of variant text.Specifically,we encode the Pinyin features of Chinese characters using the embedding layer,and the font features of Chinese characters are extracted based on convolutional neural networks directly.Considering the multimodal similarity between the source and target sentences of the Chinese variant-character-recognition task,we design the shared-weight embedding mechanism to generate target sentences using the heuristic information from the source sentences in the training process.The simulation results show that our proposed SMTM achieves remarkable performance of 89.550%and 79.480%on bilingual evaluation understudy(BLEU)and F1 metrics respectively,with significant improvement compared with state-of-the-art baseline models.展开更多
Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in e...Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in everyday scenarios such as services,healthcare,agriculture,construction,and numerous other fields.From the perspective of general robotic manipulation,the challenges arise from three factors.(1)High operational barriers:human operators are obliged to master specialized robotic programming languages and gain a deep understanding of the tasks at hand.These tasks need to be broken down into action-level robotic programs,which results in high labour costs.(2)Limited autonomous task execution:robots lack the capability to independently plan and execute actions required to achieve the target tasks.This limitation renders them unsuitable for deployment in open,unstructured environments that demand sophisticated interaction and seamless collaboration with humans.展开更多
The rapid advancements in large language models(LLMs)have led to an ex-ponential increase in survey papers,making it challenging to systematically track and analyze their evolving taxonomy.This study employs graph rep...The rapid advancements in large language models(LLMs)have led to an ex-ponential increase in survey papers,making it challenging to systematically track and analyze their evolving taxonomy.This study employs graph repre-sentation learning combined with classical machine learning techniques to model and interpret the structural evolution of LLM-related survey papers.By constructing attributed graphs that capture topic distributions and intercon-nections,we provide a data-driven framework to explore research trends in this domain.A dataset of 241 survey papers published between July 2021 and January 2024 is analyzed to identify thematic developments and interdiscipli-nary relationships.The results highlight key areas of specialization,including the emergence of prompting science,multimodal models,and domain-spe-cific applications in finance,education,and law.Co-occurrence analysis of survey topics reveals strong interconnections between core LLM research and fields such as software engineering,hardware architecture,and evaluation methodologies.These findings demonstrate the increasing specialization of LLM research and its growing integration across multiple disciplines.By lev-eraging graph-based methodologies,this study offers a structured approach to understanding the LLM survey landscape,facilitating efficient navigation of existing literature and identification of emerging research directions.The in-sights presented contribute to a more comprehensive understanding of the field’s trajectory,assisting researchers and practitioners in engaging with the latest developments in LLM research.展开更多
The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across ...The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across diverse domains,their integration into K-12 education remains in its early stages,requiring alignment with pedagogical principles,cognitive development,and curriculum standards.This paper provides a comprehensive technological review of LFM applications in K-12 education,examining current workflows,challenges,and future opportunities.We explore how LFMs facilitate personalized learning,teacher-student collaboration,and automated assessment while highlighting critical issues such as motivation,engagement,and_age-appropriate instructional strategies.By analyzing global developments,this study offers valuable insights for educators seeking to optimize AI-driven teaching methods and for students leveraging AI for self-directed learning.Our findings aim to inform future research and drive innovation in educational Al,ensuring the effective and ethical integration of LFMs into the evolving K-12 educational landscape.展开更多
Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval per...Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.展开更多
Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradig...Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradigm designed to enhance both performance and efficiency at scale.LIM employs end-to-end learning and universal modeling to create an upstream foundation model,which is capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges,instruments,and frequencies.These“global patterns”are subsequently transferred to downstream strategy modeling,optimizing performance for specific tasks.We detail the system architecture design of LIM,address the technical challenges inherent in this approach,and outline potential directions for future research.展开更多
For the multipath fading on electromagnetic waves of wireless communication in the confined areas,the rectangular tunnel cooperative communication system was established based on the multimode channel model and the ch...For the multipath fading on electromagnetic waves of wireless communication in the confined areas,the rectangular tunnel cooperative communication system was established based on the multimode channel model and the channel capacity formula derivation was obtained.On the optimal criterion of the channel capacity,the power allocation methods of both amplifying and forwarding(AF) and decoding and forwarding(DF) cooperative communication systems were proposed in the limitation of the total power to maximize the channel capacity.The mode selection methods of single input single output(SISO) and single input multiple output(SIMO) models in the rectangular tunnel,through which the higher channel capacity can be obtained,were put forward as well.The theoretical analysis and simulation comparison show that,channel capacity of the wireless communication system in the rectangular tunnel can be effectively enhanced through the cooperative technology;channel capacity of the rectangular tunnel under complicated conditions is maximized through the proposed power allocation methods,and the optimal cooperative mode of the channel capacity can be chosen according to the cooperative mode selection methods given in the paper.展开更多
A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the...A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the AMPPCA algorithm first estimates a statistical description for each operating mode by applying mixture probabilistic principal component analysis(MPPCA). As a comparison, the combined MPPCA is employed where monitoring results are softly integrated according to posterior probabilities of the test sample in each local model. For exploiting the cross-mode correlations, which may be useful but are inadvertently neglected due to separately held monitoring approaches, a global monitoring model is constructed by aligning all local models together. In this way, both within-mode and cross-mode correlations are preserved in this integrated space. Finally, the utility and feasibility of AMPPCA are demonstrated through a non-isothermal continuous stirred tank reactor and the TE benchmark process.展开更多
基金supported by the Technology Research and Development Program of China National Railway Group(Q2024T002)the Open Project Fund of National Engineering Research Center of Digital Construction and Evaluation Technology of Urban Rail Transit(2024023).
文摘Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25% AUROC and 92.66% F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.
基金We acknowledge funding from NSFC Grant 62306283.
文摘Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the last two decades.Recently,transformer-based Pre-trained Language Models(PLM)have excelled in Natural Language Processing(NLP)tasks by leveraging large-scale training corpora.Increasing the scale of these models enhances performance significantly,introducing abilities like context learning that smaller models lack.The advancement in Large Language Models,exemplified by the development of ChatGPT,has made significant impacts both academically and industrially,capturing widespread societal interest.This survey provides an overview of the development and prospects from Large Language Models(LLM)to Large Multimodal Models(LMM).It first discusses the contributions and technological advancements of LLMs in the field of natural language processing,especially in text generation and language understanding.Then,it turns to the discussion of LMMs,which integrates various data modalities such as text,images,and sound,demonstrating advanced capabilities in understanding and generating cross-modal content,paving new pathways for the adaptability and flexibility of AI systems.Finally,the survey highlights the prospects of LMMs in terms of technological development and application potential,while also pointing out challenges in data integration,cross-modal understanding accuracy,providing a comprehensive perspective on the latest developments in this field.
文摘Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for us to explore new me-thods and approaches concerning the prevention and treatment of mental illne-sses.Large multimodal models(LMMs),as the most advanced artificial intelligen-ce models(i.e.ChatGPT-4),have brought new hope to the accurate prevention,diagnosis,and treatment of psychiatric disorders.The assistance of these models in the promotion of mental health is critical,as the latter necessitates a strong foundation of medical knowledge and professional skills,emotional support,stigma mitigation,the encouragement of more honest patient self-disclosure,reduced health care costs,improved medical efficiency,and greater mental health service coverage.However,these models must address challenges related to health,safety,hallucinations,and ethics simultaneously.In the future,we should address these challenges by developing relevant usage manuals,accountability rules,and legal regulations;implementing a human-centered approach;and intelligently upgrading LMMs through the deep optimization of such models,their algorithms,and other means.This effort will thus substantially contribute not only to the maintenance of students’health but also to the achievement of global sustainable development goals.
基金National Natural Science Foundation of China,Grant No.62273203National Key R&D Program of China,Grant No.2018YFB1307101Taishan Scholars Program of Shandong Province,Grant No.ts201511005.
文摘Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the failure of traditional methods.To address the occlusion problem,this paper first proposes a novel occlusion handling method based on the large multimodal model(LMM).The method utilises an LMM to detect and analyse input RGB images and generates adjustment actions to progressively eliminate occlusion.After the occlusion is handled,an improved AOD method based on a deep Q-learning network(DQN)is used to complete the task.We introduce an attention mechanism to process image features,enabling the model to focus on critical regions of the input images.Additionally,a new reward function is proposed that comprehensively considers the bounding box of the target object and the robot's distance to the object,along with the actions performed by the robot.Ex-periments on the dataset and in real-world scenarios validate the effectiveness of the proposed method in performing AOD tasks under partial occlusion.
基金supported by a grant from the Ministry of Research,Innovation and Digitization,CNCS/CCCDI-UEFISCDI,project number COFUND-CETP-SMART-LEM-1,within PNCDI Ⅳ.
文摘Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.
文摘Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensive training datasets.The public release and rapid refinement of large language models(LLMs)like ChatGPT have accelerated the adoption of GAI across various medical specialties,offering new tools for education,clinical simulation,and research.Dermatology training,which heavily relies on visual pattern recognition and requires extensive exposure to diverse morphological presentations,faces persistent challenges such as uneven distribu-tion of educational resources,limited patient exposure for rare conditions,and variability in teaching quality.Exploring the integration of GAI into pedagogical frameworks offers innovative approaches to address these challenges,potentially enhancing the quality,standardization,scalability,and accessibility of dermatology ed-ucation.This comprehensive review examines the core concepts and technical foundations of GAI,highlights its specific applications within dermatology teaching and learning—including simulated case generation,per-sonalized learning pathways,and academic support—and discusses the current limitations,practical challenges,and ethical considerations surrounding its use.The aim is to provide a balanced perspective on the significant potential of GAI for transforming dermatology education and to offer evidence-based insights to guide future exploration,implementation,and policy development.
基金funding partly by the National Natural Science Foundation of China under grant number 61701179.
文摘Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on separate image and text modalities,which may not fully utilize the information available from both sources.To address this limitation,we propose a novel multimodal large model,i.e.,the PKME-MLM(Prior Knowledge and Multi-label Emotion analysis based Multimodal Large Model for sarcasm detection).The PKME-MLM aims to enhance sarcasm detection by integrating prior knowledge to extract useful textual information from images,which is then combined with text data for deeper analysis.This method improves the integration of image and text data,addressing the limitation of previous models that process these modalities separately.Additionally,we incorporate multi-label sentiment analysis,refining sentiment labels to improve sarcasm recognition accuracy.This design overcomes the limitations of prior models that treated sentiment classification as a single-label problem,thereby improving sarcasm recognition by distinguishing subtle emotional cues from the text.Experimental results demonstrate that our approach achieves significant performance improvements in multimodal sarcasm detection tasks,with an accuracy(Acc.)of 94.35%,and Macro-Average Precision and Recall reaching 93.92%and 94.21%,respectively.These results highlight the potential of multimodal models in improving sarcasm detection and suggest that further integration of modalities could advance future research.This work also paves the way for incorporating multimodal sentiment analysis into sarcasm detection.
基金Supported by China Health Promotion Foundation Young Doctors’Research Foundation for Inflammatory Bowel DiseaseTaishan Scholars Program of Shandong Province,China,NO.tsqn202306343National Natural Science Foundation of China,No.82270580,No.82070552,No.82270578,and No.82300599.
文摘BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.
基金The Natural Science Foundation of Hebei Province(F2024501044).
文摘The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.
文摘User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore the rich image resources generated by users,and the existing attempts touch on the multimodal domain,but still face the challenge of semantic differences between text and images.Given this,we investigate the UIL task across different social media platforms based on multimodal user-generated contents(UGCs).We innovatively introduce the efficient user identity linkage via aligned multi-modal features and temporal correlation(EUIL)approach.The method first generates captions for user-posted images with the BLIP model,alleviating the problem of missing textual information.Subsequently,we extract aligned text and image features with the CLIP model,which closely aligns the two modalities and significantly reduces the semantic gap.Accordingly,we construct a set of adapter modules to integrate the multimodal features.Furthermore,we design a temporal weight assignment mechanism to incorporate the temporal dimension of user behavior.We evaluate the proposed scheme on the real-world social dataset TWIN,and the results show that our method reaches 86.39%accuracy,which demonstrates the excellence in handling multimodal data,and provides strong algorithmic support for UIL.
基金Supported by the National Natural Science Foundation of China(72088101,42372175)PetroChina Science and Technology Innovation Fund Program(2021DQ02-0904)。
文摘This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large models in vertical industries,outlines the challenges and issues confronted in applying large models in the oil and gas sector,and offers prospects for the application of large models in the oil and gas industry.The existing large models can be briefly divided into three categories:large language models,visual large models,and multimodal large models.The application of large models in the oil and gas industry is still in its infancy.Based on open-source large language models,some oil and gas enterprises have released large language model products using methods like fine-tuning and retrieval augmented generation.Scholars have attempted to develop scenario-specific models for oil and gas operations by using visual/multimodal foundation models.A few researchers have constructed pre-trained foundation models for seismic data processing and interpretation,as well as core analysis.The application of large models in the oil and gas industry faces challenges such as current data quantity and quality being difficult to support the training of large models,high research and development costs,and poor algorithm autonomy and control.The application of large models should be guided by the needs of oil and gas business,taking the application of large models as an opportunity to improve data lifecycle management,enhance data governance capabilities,promote the construction of computing power,strengthen the construction of“artificial intelligence+energy”composite teams,and boost the autonomy and control of large model technology.
基金funded by the MUREP High Volume project(80NSSC22M0132)through the U.S.NASA Office of STEM Engagementthe SMART IAC Project(DE-EE0009726)through the U.S.Department of Energy Office of Manufacturing and Energy Supply Chainssupport of San Diego Supercomputer Center(SDSC)National Research Platform(NRP)Nautilus sponsored by the U.S.NSF(2100237,2120019)。
文摘The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current intelligent AM(IAM)systems face limitations such as fragmented AI tool usage and suboptimal human-machine interaction.This paper reviews existing IAM solutions,emphasizing control,monitoring,process autonomy,and end-to-end integration,and identifies key limitations,such as the absence of a high-level controller for global decision-making.To address these gaps,we propose a transition from IAM to autonomous AM,featuring a hierarchical framework with four integrated layers:knowledge,generative solution,operational,and cognitive.In the cognitive layer,AI agents notably enable machines to independently observe,analyze,plan,and execute operations that traditionally require human intervention.These capabilities streamline production processes and expand the possibilities for innovation,particularly in sectors like in-space manufacturing.Additionally,this paper discusses the role of AI in self-optimization and lifelong learning,positing that the future of AM will be characterized by a symbiotic relationship between human expertise and advanced autonomy,fostering a more adaptive,resilient manufacturing ecosystem.
基金Project supported by the Consulting Project of the Chinese Academy of Engineering(No.2023-XY-09)the National Natural Science Foundation of China(No.62272100)the Academy-Locality Cooperation Project of the Chinese Academy of Engineering(No.JS2021ZT05)。
文摘The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensitive words.Most existing approaches predominantly prioritize the acquisition of contextual knowledge from Chinese corpora and vocabularies during pretraining,often overlooking the inherent phonological and morphological characteristics of the Chinese language.To address these issues,we propose a shared-weight multimodal translation model(SMTM)based on multimodal information of Chinese characters,which integrates the phonology of Pinyin and the morphology of fonts into each Chinese character token to learn the deeper semantics of variant text.Specifically,we encode the Pinyin features of Chinese characters using the embedding layer,and the font features of Chinese characters are extracted based on convolutional neural networks directly.Considering the multimodal similarity between the source and target sentences of the Chinese variant-character-recognition task,we design the shared-weight embedding mechanism to generate target sentences using the heuristic information from the source sentences in the training process.The simulation results show that our proposed SMTM achieves remarkable performance of 89.550%and 79.480%on bilingual evaluation understudy(BLEU)and F1 metrics respectively,with significant improvement compared with state-of-the-art baseline models.
基金supported by the Guangdong Provincial Science and Technology Program(Grant No.2023A0505030003).
文摘Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in everyday scenarios such as services,healthcare,agriculture,construction,and numerous other fields.From the perspective of general robotic manipulation,the challenges arise from three factors.(1)High operational barriers:human operators are obliged to master specialized robotic programming languages and gain a deep understanding of the tasks at hand.These tasks need to be broken down into action-level robotic programs,which results in high labour costs.(2)Limited autonomous task execution:robots lack the capability to independently plan and execute actions required to achieve the target tasks.This limitation renders them unsuitable for deployment in open,unstructured environments that demand sophisticated interaction and seamless collaboration with humans.
文摘The rapid advancements in large language models(LLMs)have led to an ex-ponential increase in survey papers,making it challenging to systematically track and analyze their evolving taxonomy.This study employs graph repre-sentation learning combined with classical machine learning techniques to model and interpret the structural evolution of LLM-related survey papers.By constructing attributed graphs that capture topic distributions and intercon-nections,we provide a data-driven framework to explore research trends in this domain.A dataset of 241 survey papers published between July 2021 and January 2024 is analyzed to identify thematic developments and interdiscipli-nary relationships.The results highlight key areas of specialization,including the emergence of prompting science,multimodal models,and domain-spe-cific applications in finance,education,and law.Co-occurrence analysis of survey topics reveals strong interconnections between core LLM research and fields such as software engineering,hardware architecture,and evaluation methodologies.These findings demonstrate the increasing specialization of LLM research and its growing integration across multiple disciplines.By lev-eraging graph-based methodologies,this study offers a structured approach to understanding the LLM survey landscape,facilitating efficient navigation of existing literature and identification of emerging research directions.The in-sights presented contribute to a more comprehensive understanding of the field’s trajectory,assisting researchers and practitioners in engaging with the latest developments in LLM research.
文摘The rapid advancement of artificial intelligence has significantly impacted education,with largescalefoundatiomn odels(LFMs)emerging as transformative tools.While LFMs have demonstrated exceptional performance across diverse domains,their integration into K-12 education remains in its early stages,requiring alignment with pedagogical principles,cognitive development,and curriculum standards.This paper provides a comprehensive technological review of LFM applications in K-12 education,examining current workflows,challenges,and future opportunities.We explore how LFMs facilitate personalized learning,teacher-student collaboration,and automated assessment while highlighting critical issues such as motivation,engagement,and_age-appropriate instructional strategies.By analyzing global developments,this study offers valuable insights for educators seeking to optimize AI-driven teaching methods and for students leveraging AI for self-directed learning.Our findings aim to inform future research and drive innovation in educational Al,ensuring the effective and ethical integration of LFMs into the evolving K-12 educational landscape.
文摘Video-text retrieval (VTR) is an essential task in multimodal learning, aiming to bridge the semantic gap between visual and textual data. Effective video frame sampling plays a crucial role in improving retrieval performance, as it determines the quality of the visual content representation. Traditional sampling methods, such as uniform sampling and optical flow-based techniques, often fail to capture the full semantic range of videos, leading to redundancy and inefficiencies. In this work, we propose CLIP4Video-Sampling: Global Semantics-Guided Multi-Granularity Frame Sampling for Video-Text Retrieval, a global semantics-guided multi-granularity frame sampling strategy designed to optimize both computational efficiency and retrieval accuracy. By integrating multi-scale global and local temporal sampling and leveraging the CLIP (Contrastive Language-Image Pre-training) model’s powerful feature extraction capabilities, our method significantly outperforms existing approaches in both zero-shot and fine-tuned video-text retrieval tasks on popular datasets. CLIP4Video-Sampling reduces redundancy, ensures keyframe coverage, and serves as an adaptable pre-processing module for multimodal models.
文摘Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs.To overcome these challenges,we introduce the large investment model(LIM),a novel research paradigm designed to enhance both performance and efficiency at scale.LIM employs end-to-end learning and universal modeling to create an upstream foundation model,which is capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges,instruments,and frequencies.These“global patterns”are subsequently transferred to downstream strategy modeling,optimizing performance for specific tasks.We detail the system architecture design of LIM,address the technical challenges inherent in this approach,and outline potential directions for future research.
基金financial supports provided by the National Natural Science Foundation of China (No.51274202)the Fundamental Research Funds for the Central Universities (No.2013RC11)+3 种基金the Science and Technology Achievements Transformation Project of Jiangsu Province (No.BA2012068)the Natural Science Foundation of Jiangsu Province (Nos.BK20130199 and BK20131124)Ceeusro Prospective Joint Research Project of Jiangsu Province (No.BY2014028-01)Great Cultivating Special Project at China University of Mining and Technology (No.2014ZDPY16)
文摘For the multipath fading on electromagnetic waves of wireless communication in the confined areas,the rectangular tunnel cooperative communication system was established based on the multimode channel model and the channel capacity formula derivation was obtained.On the optimal criterion of the channel capacity,the power allocation methods of both amplifying and forwarding(AF) and decoding and forwarding(DF) cooperative communication systems were proposed in the limitation of the total power to maximize the channel capacity.The mode selection methods of single input single output(SISO) and single input multiple output(SIMO) models in the rectangular tunnel,through which the higher channel capacity can be obtained,were put forward as well.The theoretical analysis and simulation comparison show that,channel capacity of the wireless communication system in the rectangular tunnel can be effectively enhanced through the cooperative technology;channel capacity of the rectangular tunnel under complicated conditions is maximized through the proposed power allocation methods,and the optimal cooperative mode of the channel capacity can be chosen according to the cooperative mode selection methods given in the paper.
基金Supported by the National Natural Science Foundation of China(61374140)Shanghai Pujiang Program(12PJ1402200)
文摘A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the AMPPCA algorithm first estimates a statistical description for each operating mode by applying mixture probabilistic principal component analysis(MPPCA). As a comparison, the combined MPPCA is employed where monitoring results are softly integrated according to posterior probabilities of the test sample in each local model. For exploiting the cross-mode correlations, which may be useful but are inadvertently neglected due to separately held monitoring approaches, a global monitoring model is constructed by aligning all local models together. In this way, both within-mode and cross-mode correlations are preserved in this integrated space. Finally, the utility and feasibility of AMPPCA are demonstrated through a non-isothermal continuous stirred tank reactor and the TE benchmark process.