Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical impo...Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25% AUROC and 92.66% F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.展开更多
Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the ...Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the last two decades.Recently,transformer-based Pre-trained Language Models(PLM)have excelled in Natural Language Processing(NLP)tasks by leveraging large-scale training corpora.Increasing the scale of these models enhances performance significantly,introducing abilities like context learning that smaller models lack.The advancement in Large Language Models,exemplified by the development of ChatGPT,has made significant impacts both academically and industrially,capturing widespread societal interest.This survey provides an overview of the development and prospects from Large Language Models(LLM)to Large Multimodal Models(LMM).It first discusses the contributions and technological advancements of LLMs in the field of natural language processing,especially in text generation and language understanding.Then,it turns to the discussion of LMMs,which integrates various data modalities such as text,images,and sound,demonstrating advanced capabilities in understanding and generating cross-modal content,paving new pathways for the adaptability and flexibility of AI systems.Finally,the survey highlights the prospects of LMMs in terms of technological development and application potential,while also pointing out challenges in data integration,cross-modal understanding accuracy,providing a comprehensive perspective on the latest developments in this field.展开更多
Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for...Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for us to explore new me-thods and approaches concerning the prevention and treatment of mental illne-sses.Large multimodal models(LMMs),as the most advanced artificial intelligen-ce models(i.e.ChatGPT-4),have brought new hope to the accurate prevention,diagnosis,and treatment of psychiatric disorders.The assistance of these models in the promotion of mental health is critical,as the latter necessitates a strong foundation of medical knowledge and professional skills,emotional support,stigma mitigation,the encouragement of more honest patient self-disclosure,reduced health care costs,improved medical efficiency,and greater mental health service coverage.However,these models must address challenges related to health,safety,hallucinations,and ethics simultaneously.In the future,we should address these challenges by developing relevant usage manuals,accountability rules,and legal regulations;implementing a human-centered approach;and intelligently upgrading LMMs through the deep optimization of such models,their algorithms,and other means.This effort will thus substantially contribute not only to the maintenance of students’health but also to the achievement of global sustainable development goals.展开更多
Lysosomes are essential organelles for cells that act as the“recycling center”for decomposing biomolecules and clearing out damaged organelles.The status of lysosomes is tightly regulated by cells to maintain normal...Lysosomes are essential organelles for cells that act as the“recycling center”for decomposing biomolecules and clearing out damaged organelles.The status of lysosomes is tightly regulated by cells to maintain normal homeostasis.To monitor subcellular lysosomal status,super-resolution imaging has emerged as a promising technology that surpasses conventional imaging limitations,offering extraordinary visualization capability.However,existing fluorescent probes for super-resolution imaging still suffer from significant drawbacks,such as complex synthesis,poor intracellular stability,and the lack of near-infrared(NIR)imaging capability.Besides,to quantitatively analyze fluorescence images,traditional human-driven image interpretation is time-consuming and prone to information loss and human error.To tackle these challenges,we first developed a quinoliniumbased fluorescent probe,PA-2,for NIR super-resolution imaging of lysosomes with low cytotoxicity and stable fluorescence.Harnessing PA-2’s strong resistance to photobleaching,the lysosomal dynamic statuses,encompassing autophagy,mitochondrialysosome contacts,and mitophagy,were successfully visualized.Building on this,we next demonstrate a novel approach leveraging a large multimodal model(LMM),an advanced artificial intelligence(AI)tool,for automated analysis of super-resolution images.The LMM accurately interprets images of PA-2 and predicts lysosomal status under various drug treatments with remarkable speed,precision,and explainability,significantly outperforming human experts in image analysis.To sum up,this work highlights the strong potential of combining advanced fluorescent probe design with AI-assisted image interpretation to drive revolutionary innovation in bioimaging and beyond.展开更多
Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the fail...Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the failure of traditional methods.To address the occlusion problem,this paper first proposes a novel occlusion handling method based on the large multimodal model(LMM).The method utilises an LMM to detect and analyse input RGB images and generates adjustment actions to progressively eliminate occlusion.After the occlusion is handled,an improved AOD method based on a deep Q-learning network(DQN)is used to complete the task.We introduce an attention mechanism to process image features,enabling the model to focus on critical regions of the input images.Additionally,a new reward function is proposed that comprehensively considers the bounding box of the target object and the robot's distance to the object,along with the actions performed by the robot.Ex-periments on the dataset and in real-world scenarios validate the effectiveness of the proposed method in performing AOD tasks under partial occlusion.展开更多
Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing...Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.展开更多
Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensiv...Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensive training datasets.The public release and rapid refinement of large language models(LLMs)like ChatGPT have accelerated the adoption of GAI across various medical specialties,offering new tools for education,clinical simulation,and research.Dermatology training,which heavily relies on visual pattern recognition and requires extensive exposure to diverse morphological presentations,faces persistent challenges such as uneven distribu-tion of educational resources,limited patient exposure for rare conditions,and variability in teaching quality.Exploring the integration of GAI into pedagogical frameworks offers innovative approaches to address these challenges,potentially enhancing the quality,standardization,scalability,and accessibility of dermatology ed-ucation.This comprehensive review examines the core concepts and technical foundations of GAI,highlights its specific applications within dermatology teaching and learning—including simulated case generation,per-sonalized learning pathways,and academic support—and discusses the current limitations,practical challenges,and ethical considerations surrounding its use.The aim is to provide a balanced perspective on the significant potential of GAI for transforming dermatology education and to offer evidence-based insights to guide future exploration,implementation,and policy development.展开更多
Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on se...Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on separate image and text modalities,which may not fully utilize the information available from both sources.To address this limitation,we propose a novel multimodal large model,i.e.,the PKME-MLM(Prior Knowledge and Multi-label Emotion analysis based Multimodal Large Model for sarcasm detection).The PKME-MLM aims to enhance sarcasm detection by integrating prior knowledge to extract useful textual information from images,which is then combined with text data for deeper analysis.This method improves the integration of image and text data,addressing the limitation of previous models that process these modalities separately.Additionally,we incorporate multi-label sentiment analysis,refining sentiment labels to improve sarcasm recognition accuracy.This design overcomes the limitations of prior models that treated sentiment classification as a single-label problem,thereby improving sarcasm recognition by distinguishing subtle emotional cues from the text.Experimental results demonstrate that our approach achieves significant performance improvements in multimodal sarcasm detection tasks,with an accuracy(Acc.)of 94.35%,and Macro-Average Precision and Recall reaching 93.92%and 94.21%,respectively.These results highlight the potential of multimodal models in improving sarcasm detection and suggest that further integration of modalities could advance future research.This work also paves the way for incorporating multimodal sentiment analysis into sarcasm detection.展开更多
BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and ...BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.展开更多
The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can si...The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.展开更多
Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelli...Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.展开更多
Large language models(LLMs)have emerged as transformative tools in radiology artificial intelligence(AI),offering significant capabilities in areas such as image report generation,clinical decision support,and workflo...Large language models(LLMs)have emerged as transformative tools in radiology artificial intelligence(AI),offering significant capabilities in areas such as image report generation,clinical decision support,and workflow optimization.The first part of this manuscript presents a comprehensive overview of the current state of LLM applications in radiology,including their historical evolution,technical foundations,and practical uses.Despite notable advances,inherent architectural constraints,such as token-level sequential processing,limit their ability to perform deep abstract reasoning and holistic contextual understanding,which are critical for fine-grained diagnostic interpretation.We provide a critical perspective on current LLMs and discuss key challenges,including model reliability,bias,and explainability,highlighting the pressing need for novel approaches to advance radiology AI.Large concept models(LCMs)represent a nascent and promising paradigm in radiology AI,designed to transcend the limitations of token-level processing by utilizing higher-order conceptual representations and multimodal data integration.The second part of this manuscript introduces the foundational principles and theoretical framework of LCMs,highlighting their potential to facilitate enhanced semantic reasoning,long-range context synthesis,and improved clinical decision-making.Critically,the core of this section is the proposal of a novel theoretical framework for LCMs,formalized and extended from our group’s foundational concept-based models-the world’s earliest articulation of this paradigm for medical AI.This conceptual shift has since been externally validated and propelled by the recent publication of the LCM architectural proposal by Meta AI,providing a large-scale engineering blueprint for the future development of this technology.We also outline future research directions and the transformative implications of this emerging AI paradigm for radiologic practice,aiming to provide a blueprint for advancing toward human-like conceptual understanding in AI.While challenges persist,we are at the very beginning of a new era,and it is not unreasonable to hope that future advancements will overcome these hurdles,pushing the boundaries of AI in Radiology,far beyond even the most state-of-the-art models of today.展开更多
User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore ...User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore the rich image resources generated by users,and the existing attempts touch on the multimodal domain,but still face the challenge of semantic differences between text and images.Given this,we investigate the UIL task across different social media platforms based on multimodal user-generated contents(UGCs).We innovatively introduce the efficient user identity linkage via aligned multi-modal features and temporal correlation(EUIL)approach.The method first generates captions for user-posted images with the BLIP model,alleviating the problem of missing textual information.Subsequently,we extract aligned text and image features with the CLIP model,which closely aligns the two modalities and significantly reduces the semantic gap.Accordingly,we construct a set of adapter modules to integrate the multimodal features.Furthermore,we design a temporal weight assignment mechanism to incorporate the temporal dimension of user behavior.We evaluate the proposed scheme on the real-world social dataset TWIN,and the results show that our method reaches 86.39%accuracy,which demonstrates the excellence in handling multimodal data,and provides strong algorithmic support for UIL.展开更多
Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural ...Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.展开更多
This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large mode...This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large models in vertical industries,outlines the challenges and issues confronted in applying large models in the oil and gas sector,and offers prospects for the application of large models in the oil and gas industry.The existing large models can be briefly divided into three categories:large language models,visual large models,and multimodal large models.The application of large models in the oil and gas industry is still in its infancy.Based on open-source large language models,some oil and gas enterprises have released large language model products using methods like fine-tuning and retrieval augmented generation.Scholars have attempted to develop scenario-specific models for oil and gas operations by using visual/multimodal foundation models.A few researchers have constructed pre-trained foundation models for seismic data processing and interpretation,as well as core analysis.The application of large models in the oil and gas industry faces challenges such as current data quantity and quality being difficult to support the training of large models,high research and development costs,and poor algorithm autonomy and control.The application of large models should be guided by the needs of oil and gas business,taking the application of large models as an opportunity to improve data lifecycle management,enhance data governance capabilities,promote the construction of computing power,strengthen the construction of“artificial intelligence+energy”composite teams,and boost the autonomy and control of large model technology.展开更多
The rapid advancement of large models has led to the development of increasingly sophisticated models capable of generating diverse,personalized,and high-quality content.Among these,DeepSeek has emerged as a pivotal o...The rapid advancement of large models has led to the development of increasingly sophisticated models capable of generating diverse,personalized,and high-quality content.Among these,DeepSeek has emerged as a pivotal open-source initiative,demonstrating high performance at significantly lower computation costs compared to closed-source counterparts.This survey provides a comprehensive overview of the DeepSeek family of models,including DeepSeek-V3 and DeepSeek-R1,covering their core innovations in architecture,system pipeline,algorithm,and infrastructure.We explore their practical applications across various domains,such as healthcare,finance,and education,highlighting their impact on both industry and society.Further-more,we examine potential security,privacy,and ethical concerns arising from the widespread deployment of these models,emphasizing the need for responsible AI development.Finally,we outline future research directions to enhance the performance,safety,and scalability of DeepSeek models,aiming to foster further advancements in the open-source large model community.展开更多
The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current i...The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current intelligent AM(IAM)systems face limitations such as fragmented AI tool usage and suboptimal human-machine interaction.This paper reviews existing IAM solutions,emphasizing control,monitoring,process autonomy,and end-to-end integration,and identifies key limitations,such as the absence of a high-level controller for global decision-making.To address these gaps,we propose a transition from IAM to autonomous AM,featuring a hierarchical framework with four integrated layers:knowledge,generative solution,operational,and cognitive.In the cognitive layer,AI agents notably enable machines to independently observe,analyze,plan,and execute operations that traditionally require human intervention.These capabilities streamline production processes and expand the possibilities for innovation,particularly in sectors like in-space manufacturing.Additionally,this paper discusses the role of AI in self-optimization and lifelong learning,positing that the future of AM will be characterized by a symbiotic relationship between human expertise and advanced autonomy,fostering a more adaptive,resilient manufacturing ecosystem.展开更多
The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensi...The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensitive words.Most existing approaches predominantly prioritize the acquisition of contextual knowledge from Chinese corpora and vocabularies during pretraining,often overlooking the inherent phonological and morphological characteristics of the Chinese language.To address these issues,we propose a shared-weight multimodal translation model(SMTM)based on multimodal information of Chinese characters,which integrates the phonology of Pinyin and the morphology of fonts into each Chinese character token to learn the deeper semantics of variant text.Specifically,we encode the Pinyin features of Chinese characters using the embedding layer,and the font features of Chinese characters are extracted based on convolutional neural networks directly.Considering the multimodal similarity between the source and target sentences of the Chinese variant-character-recognition task,we design the shared-weight embedding mechanism to generate target sentences using the heuristic information from the source sentences in the training process.The simulation results show that our proposed SMTM achieves remarkable performance of 89.550%and 79.480%on bilingual evaluation understudy(BLEU)and F1 metrics respectively,with significant improvement compared with state-of-the-art baseline models.展开更多
Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in e...Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in everyday scenarios such as services,healthcare,agriculture,construction,and numerous other fields.From the perspective of general robotic manipulation,the challenges arise from three factors.(1)High operational barriers:human operators are obliged to master specialized robotic programming languages and gain a deep understanding of the tasks at hand.These tasks need to be broken down into action-level robotic programs,which results in high labour costs.(2)Limited autonomous task execution:robots lack the capability to independently plan and execute actions required to achieve the target tasks.This limitation renders them unsuitable for deployment in open,unstructured environments that demand sophisticated interaction and seamless collaboration with humans.展开更多
The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge...The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.展开更多
基金supported by the Technology Research and Development Program of China National Railway Group(Q2024T002)the Open Project Fund of National Engineering Research Center of Digital Construction and Evaluation Technology of Urban Rail Transit(2024023).
文摘Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25% AUROC and 92.66% F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.
基金We acknowledge funding from NSFC Grant 62306283.
文摘Since the 1950s,when the Turing Test was introduced,there has been notable progress in machine language intelligence.Language modeling,crucial for AI development,has evolved from statistical to neural models over the last two decades.Recently,transformer-based Pre-trained Language Models(PLM)have excelled in Natural Language Processing(NLP)tasks by leveraging large-scale training corpora.Increasing the scale of these models enhances performance significantly,introducing abilities like context learning that smaller models lack.The advancement in Large Language Models,exemplified by the development of ChatGPT,has made significant impacts both academically and industrially,capturing widespread societal interest.This survey provides an overview of the development and prospects from Large Language Models(LLM)to Large Multimodal Models(LMM).It first discusses the contributions and technological advancements of LLMs in the field of natural language processing,especially in text generation and language understanding.Then,it turns to the discussion of LMMs,which integrates various data modalities such as text,images,and sound,demonstrating advanced capabilities in understanding and generating cross-modal content,paving new pathways for the adaptability and flexibility of AI systems.Finally,the survey highlights the prospects of LMMs in terms of technological development and application potential,while also pointing out challenges in data integration,cross-modal understanding accuracy,providing a comprehensive perspective on the latest developments in this field.
文摘Students are considered one of the groups most affected by psychological pro-blems.Given the highly dangerous nature of mental illnesses and the increasing-ly serious state of global mental health,it is imperative for us to explore new me-thods and approaches concerning the prevention and treatment of mental illne-sses.Large multimodal models(LMMs),as the most advanced artificial intelligen-ce models(i.e.ChatGPT-4),have brought new hope to the accurate prevention,diagnosis,and treatment of psychiatric disorders.The assistance of these models in the promotion of mental health is critical,as the latter necessitates a strong foundation of medical knowledge and professional skills,emotional support,stigma mitigation,the encouragement of more honest patient self-disclosure,reduced health care costs,improved medical efficiency,and greater mental health service coverage.However,these models must address challenges related to health,safety,hallucinations,and ethics simultaneously.In the future,we should address these challenges by developing relevant usage manuals,accountability rules,and legal regulations;implementing a human-centered approach;and intelligently upgrading LMMs through the deep optimization of such models,their algorithms,and other means.This effort will thus substantially contribute not only to the maintenance of students’health but also to the achievement of global sustainable development goals.
基金support of the US National Science Foundation(CHE-2453603)supported by the National Institutes of Health(NIH R35GM128837).
文摘Lysosomes are essential organelles for cells that act as the“recycling center”for decomposing biomolecules and clearing out damaged organelles.The status of lysosomes is tightly regulated by cells to maintain normal homeostasis.To monitor subcellular lysosomal status,super-resolution imaging has emerged as a promising technology that surpasses conventional imaging limitations,offering extraordinary visualization capability.However,existing fluorescent probes for super-resolution imaging still suffer from significant drawbacks,such as complex synthesis,poor intracellular stability,and the lack of near-infrared(NIR)imaging capability.Besides,to quantitatively analyze fluorescence images,traditional human-driven image interpretation is time-consuming and prone to information loss and human error.To tackle these challenges,we first developed a quinoliniumbased fluorescent probe,PA-2,for NIR super-resolution imaging of lysosomes with low cytotoxicity and stable fluorescence.Harnessing PA-2’s strong resistance to photobleaching,the lysosomal dynamic statuses,encompassing autophagy,mitochondrialysosome contacts,and mitophagy,were successfully visualized.Building on this,we next demonstrate a novel approach leveraging a large multimodal model(LMM),an advanced artificial intelligence(AI)tool,for automated analysis of super-resolution images.The LMM accurately interprets images of PA-2 and predicts lysosomal status under various drug treatments with remarkable speed,precision,and explainability,significantly outperforming human experts in image analysis.To sum up,this work highlights the strong potential of combining advanced fluorescent probe design with AI-assisted image interpretation to drive revolutionary innovation in bioimaging and beyond.
基金National Natural Science Foundation of China,Grant No.62273203National Key R&D Program of China,Grant No.2018YFB1307101Taishan Scholars Program of Shandong Province,Grant No.ts201511005.
文摘Active object detection(AOD)is a crucial task in the field of robotics.A key challenge in household environments for AOD is that the target object is often undetectable due to partial occlusion,which leads to the failure of traditional methods.To address the occlusion problem,this paper first proposes a novel occlusion handling method based on the large multimodal model(LMM).The method utilises an LMM to detect and analyse input RGB images and generates adjustment actions to progressively eliminate occlusion.After the occlusion is handled,an improved AOD method based on a deep Q-learning network(DQN)is used to complete the task.We introduce an attention mechanism to process image features,enabling the model to focus on critical regions of the input images.Additionally,a new reward function is proposed that comprehensively considers the bounding box of the target object and the robot's distance to the object,along with the actions performed by the robot.Ex-periments on the dataset and in real-world scenarios validate the effectiveness of the proposed method in performing AOD tasks under partial occlusion.
基金supported by a grant from the Ministry of Research,Innovation and Digitization,CNCS/CCCDI-UEFISCDI,project number COFUND-CETP-SMART-LEM-1,within PNCDI Ⅳ.
文摘Extracting data from visually rich documents and charts using traditional methods that rely on OCR-based parsing poses multiple challenges,including layout complexity in unstructured formats,limitations in recognizing visual elements,and the correlation between different parts of the documents,as well as domain-specific semantics.Simply extracting text is not sufficient;advanced reasoning capabilities are proving to be essential to analyze content and answer questions accurately.This paper aims to evaluate the ability of the Large Language Models(LLMs)to correctly answer questions about various types of charts,comparing their performance when using images as input versus directly parsing PDF files.To retrieve the images from the PDF,ColPali,a model leveraging state-of-the-art visual languagemodels,is used to identify the relevant page containing the appropriate chart for each question.Google’s Gemini multimodal models were used to answer a set of questions through two approaches:1)processing images derived from PDF documents and 2)directly utilizing the content of the same PDFs.Our findings underscore the limitations of traditional OCR-based approaches in visual document understanding(VrDU)and demonstrate the advantages of multimodal methods in both data extraction and reasoning tasks.Through structured benchmarking of chart question answering(CQA)across input formats,our work contributes to the advancement of chart understanding(CU)and the broader field of multimodal document analysis.Using two diverse and information-rich sources:the World Health Statistics 2024 report by theWorld Health Organisation and the Global Banking Annual Review 2024 by McKinsey&Company,we examine the performance ofmultimodal LLMs across different input modalities,comparing their effectiveness in processing charts as images versus parsing directly from PDF content.These documents were selected due to their multimodal nature,combining dense textual analysis with varied visual representations,thus presenting realistic challenges for vision-language models.This comparison is aimed at assessing how advanced models perform with different input formats and to determine if an image-based approach enhances chart comprehension in terms of accurate data extraction and reasoning capabilities.
文摘Generative Artificial Intelligence(GAI)refers to a class of AI systems capable of creating novel,coherent,and contextually relevant content—such as text,images,audio,and video—based on patterns learned from extensive training datasets.The public release and rapid refinement of large language models(LLMs)like ChatGPT have accelerated the adoption of GAI across various medical specialties,offering new tools for education,clinical simulation,and research.Dermatology training,which heavily relies on visual pattern recognition and requires extensive exposure to diverse morphological presentations,faces persistent challenges such as uneven distribu-tion of educational resources,limited patient exposure for rare conditions,and variability in teaching quality.Exploring the integration of GAI into pedagogical frameworks offers innovative approaches to address these challenges,potentially enhancing the quality,standardization,scalability,and accessibility of dermatology ed-ucation.This comprehensive review examines the core concepts and technical foundations of GAI,highlights its specific applications within dermatology teaching and learning—including simulated case generation,per-sonalized learning pathways,and academic support—and discusses the current limitations,practical challenges,and ethical considerations surrounding its use.The aim is to provide a balanced perspective on the significant potential of GAI for transforming dermatology education and to offer evidence-based insights to guide future exploration,implementation,and policy development.
基金funding partly by the National Natural Science Foundation of China under grant number 61701179.
文摘Sarcasm detection in Natural Language Processing(NLP)has become increasingly important,partic-ularly with the rise of social media and non-textual emotional expressions,such as images.Existing methods often rely on separate image and text modalities,which may not fully utilize the information available from both sources.To address this limitation,we propose a novel multimodal large model,i.e.,the PKME-MLM(Prior Knowledge and Multi-label Emotion analysis based Multimodal Large Model for sarcasm detection).The PKME-MLM aims to enhance sarcasm detection by integrating prior knowledge to extract useful textual information from images,which is then combined with text data for deeper analysis.This method improves the integration of image and text data,addressing the limitation of previous models that process these modalities separately.Additionally,we incorporate multi-label sentiment analysis,refining sentiment labels to improve sarcasm recognition accuracy.This design overcomes the limitations of prior models that treated sentiment classification as a single-label problem,thereby improving sarcasm recognition by distinguishing subtle emotional cues from the text.Experimental results demonstrate that our approach achieves significant performance improvements in multimodal sarcasm detection tasks,with an accuracy(Acc.)of 94.35%,and Macro-Average Precision and Recall reaching 93.92%and 94.21%,respectively.These results highlight the potential of multimodal models in improving sarcasm detection and suggest that further integration of modalities could advance future research.This work also paves the way for incorporating multimodal sentiment analysis into sarcasm detection.
基金Supported by China Health Promotion Foundation Young Doctors’Research Foundation for Inflammatory Bowel DiseaseTaishan Scholars Program of Shandong Province,China,NO.tsqn202306343National Natural Science Foundation of China,No.82270580,No.82070552,No.82270578,and No.82300599.
文摘BACKGROUND Gastrointestinal diseases have complex etiologies and clinical presentations.An accurate diagnosis requires physicians to integrate diverse information,including medical history,laboratory test results,and imaging findings.Existing artificial intelligence-assisted diagnostic tools are limited to single-modality information,resulting in recommendations that are often incomplete and may be associated with clinical or legal risks.AIM To develop and evaluate a collaborative multimodal large language model(LLM)framework for clinical decision-making in digestive diseases.METHODS In this observational study,DeepGut,a multimodal LLM collaborative diagnostic framework,was developed to integrate four distinct large models into a four-tiered structure.The framework sequentially accomplishes multimodal infor-mation extraction,logical“chain”construction,diagnostic and treatment suggestion generation,and risk analysis.The model was evaluated using objective metrics,which assess the reliability and comprehensiveness of model-generated results,and subjective expert opinions,which examine the effectiveness of the framework in assisting physicians.RESULTS The diagnostic and treatment recommendations generated by the DeepGut framework achieved exceptional performance,with a diagnostic accuracy of 97.8%,diagnostic completeness of 93.9%,treatment plan accuracy of 95.2%,and treatment plan completeness of 98.0%,significantly surpassing the capabilities of single-modal LLM-based diagnostic tools.Experts evaluating the framework commended the completeness,relevance,and logical coherence of its outputs.However,the collaborative multimodal LLM approach resulted in increased input and output token counts,leading to higher computational costs and extended diagnostic times.CONCLUSION The framework achieves successful integration of multimodal diagnostic data,demonstrating enhanced performance enabled by multimodal LLM collaboration,which opens new horizons for the clinical application of artificial intelligence-assisted technology.
基金The Natural Science Foundation of Hebei Province(F2024501044).
文摘The application of visual-language large models in the field of medical health has gradually become a research focus.The models combine the capability for image understanding and natural language processing,and can simultaneously process multi-modality data such as medical images and medical reports.These models can not only recognize images,but also understand the semantic relationship between images and texts,effectively realize the integration of medical information,and provide strong support for clinical decision-making and disease diagnosis.The visual-language large model has good performance for specific medical tasks,and also shows strong potential and high intelligence in the general task models.This paper provides a comprehensive review of the visual-language large model in the field of medical health.Specifically,this paper first introduces the basic theoretical basis and technical principles.Then,this paper introduces the specific application scenarios in the field of medical health,including modality fusion,semi-supervised learning,weakly supervised learning,unsupervised learning,cross-domain model and general models.Finally,the challenges including insufficient data,interpretability,and practical deployment are discussed.According to the existing challenges,four potential future development directions are given.
基金Supported by the Open Project Program of Panxi Crops Research and Utilization Key Laboratory of Sichuan Province,No.SZKF202302the Fundamental Research Funds for the Central Universities No.2019CDYGYB024.
文摘Gastrointestinal(GI)cancers represent a major global health concern due to their high incidence and mortality rates.Foundation models(FMs),also referred to as large models,represent a novel class of artificial intelligence technologies that have demonstrated considerable potential in addressing these challenges.These models encompass large language models(LLMs),vision FMs(VFMs),and multimodal LLMs(MLLMs),all of which utilize transformer architectures and self-supervised pre-training on extensive unlabeled datasets to achieve robust cross-domain generalization.This review delineates the principal applications of these models:LLMs facilitate the structuring of clinical narratives,extraction of insights from medical records,and enhancement of physician-patient communication;VFMs are employed in the analysis of endoscopic,radiological,and pathological images for lesion detection and staging;MLLMs integrate heterogeneous data modalities,including imaging,textual information,and genomic data,to support diagnostic processes,treatment prediction,and prognostic evaluation.Despite these promising developments,several challenges remain,such as the need for data standardization,limited diversity within training datasets,substantial computational resource requirements,and ethical-legal concerns.In conclusion,FMs exhibit significant potential to advance research and clinical management of GI cancers.Future research efforts should prioritize the refinement of these models,promote international collaborations,and adopt interdisciplinary approaches.Such a comprehensive strategy is essential to fully harness the capabilities of FMs,driving substantial progress in the fight against GI malignancies.
文摘Large language models(LLMs)have emerged as transformative tools in radiology artificial intelligence(AI),offering significant capabilities in areas such as image report generation,clinical decision support,and workflow optimization.The first part of this manuscript presents a comprehensive overview of the current state of LLM applications in radiology,including their historical evolution,technical foundations,and practical uses.Despite notable advances,inherent architectural constraints,such as token-level sequential processing,limit their ability to perform deep abstract reasoning and holistic contextual understanding,which are critical for fine-grained diagnostic interpretation.We provide a critical perspective on current LLMs and discuss key challenges,including model reliability,bias,and explainability,highlighting the pressing need for novel approaches to advance radiology AI.Large concept models(LCMs)represent a nascent and promising paradigm in radiology AI,designed to transcend the limitations of token-level processing by utilizing higher-order conceptual representations and multimodal data integration.The second part of this manuscript introduces the foundational principles and theoretical framework of LCMs,highlighting their potential to facilitate enhanced semantic reasoning,long-range context synthesis,and improved clinical decision-making.Critically,the core of this section is the proposal of a novel theoretical framework for LCMs,formalized and extended from our group’s foundational concept-based models-the world’s earliest articulation of this paradigm for medical AI.This conceptual shift has since been externally validated and propelled by the recent publication of the LCM architectural proposal by Meta AI,providing a large-scale engineering blueprint for the future development of this technology.We also outline future research directions and the transformative implications of this emerging AI paradigm for radiologic practice,aiming to provide a blueprint for advancing toward human-like conceptual understanding in AI.While challenges persist,we are at the very beginning of a new era,and it is not unreasonable to hope that future advancements will overcome these hurdles,pushing the boundaries of AI in Radiology,far beyond even the most state-of-the-art models of today.
文摘User identity linkage(UIL)refers to identifying user accounts belonging to the same identity across different social media platforms.Most of the current research is based on text analysis,which fails to fully explore the rich image resources generated by users,and the existing attempts touch on the multimodal domain,but still face the challenge of semantic differences between text and images.Given this,we investigate the UIL task across different social media platforms based on multimodal user-generated contents(UGCs).We innovatively introduce the efficient user identity linkage via aligned multi-modal features and temporal correlation(EUIL)approach.The method first generates captions for user-posted images with the BLIP model,alleviating the problem of missing textual information.Subsequently,we extract aligned text and image features with the CLIP model,which closely aligns the two modalities and significantly reduces the semantic gap.Accordingly,we construct a set of adapter modules to integrate the multimodal features.Furthermore,we design a temporal weight assignment mechanism to incorporate the temporal dimension of user behavior.We evaluate the proposed scheme on the real-world social dataset TWIN,and the results show that our method reaches 86.39%accuracy,which demonstrates the excellence in handling multimodal data,and provides strong algorithmic support for UIL.
基金supported by the National Key Research and Development Program of China(2023YFF0906502)the Postgraduate Research and Innovation Project of Hunan Province under Grant(CX20240473).
文摘Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.
基金Supported by the National Natural Science Foundation of China(72088101,42372175)PetroChina Science and Technology Innovation Fund Program(2021DQ02-0904)。
文摘This article elucidates the concept of large model technology,summarizes the research status of large model technology both domestically and internationally,provides an overview of the application status of large models in vertical industries,outlines the challenges and issues confronted in applying large models in the oil and gas sector,and offers prospects for the application of large models in the oil and gas industry.The existing large models can be briefly divided into three categories:large language models,visual large models,and multimodal large models.The application of large models in the oil and gas industry is still in its infancy.Based on open-source large language models,some oil and gas enterprises have released large language model products using methods like fine-tuning and retrieval augmented generation.Scholars have attempted to develop scenario-specific models for oil and gas operations by using visual/multimodal foundation models.A few researchers have constructed pre-trained foundation models for seismic data processing and interpretation,as well as core analysis.The application of large models in the oil and gas industry faces challenges such as current data quantity and quality being difficult to support the training of large models,high research and development costs,and poor algorithm autonomy and control.The application of large models should be guided by the needs of oil and gas business,taking the application of large models as an opportunity to improve data lifecycle management,enhance data governance capabilities,promote the construction of computing power,strengthen the construction of“artificial intelligence+energy”composite teams,and boost the autonomy and control of large model technology.
文摘The rapid advancement of large models has led to the development of increasingly sophisticated models capable of generating diverse,personalized,and high-quality content.Among these,DeepSeek has emerged as a pivotal open-source initiative,demonstrating high performance at significantly lower computation costs compared to closed-source counterparts.This survey provides a comprehensive overview of the DeepSeek family of models,including DeepSeek-V3 and DeepSeek-R1,covering their core innovations in architecture,system pipeline,algorithm,and infrastructure.We explore their practical applications across various domains,such as healthcare,finance,and education,highlighting their impact on both industry and society.Further-more,we examine potential security,privacy,and ethical concerns arising from the widespread deployment of these models,emphasizing the need for responsible AI development.Finally,we outline future research directions to enhance the performance,safety,and scalability of DeepSeek models,aiming to foster further advancements in the open-source large model community.
基金funded by the MUREP High Volume project(80NSSC22M0132)through the U.S.NASA Office of STEM Engagementthe SMART IAC Project(DE-EE0009726)through the U.S.Department of Energy Office of Manufacturing and Energy Supply Chainssupport of San Diego Supercomputer Center(SDSC)National Research Platform(NRP)Nautilus sponsored by the U.S.NSF(2100237,2120019)。
文摘The additive manufacturing(AM)landscape has significantly transformed in alignment with Industry 4.0 principles,primarily driven by the integration of artificial intelligence(AI)and digital twins(DT).However,current intelligent AM(IAM)systems face limitations such as fragmented AI tool usage and suboptimal human-machine interaction.This paper reviews existing IAM solutions,emphasizing control,monitoring,process autonomy,and end-to-end integration,and identifies key limitations,such as the absence of a high-level controller for global decision-making.To address these gaps,we propose a transition from IAM to autonomous AM,featuring a hierarchical framework with four integrated layers:knowledge,generative solution,operational,and cognitive.In the cognitive layer,AI agents notably enable machines to independently observe,analyze,plan,and execute operations that traditionally require human intervention.These capabilities streamline production processes and expand the possibilities for innovation,particularly in sectors like in-space manufacturing.Additionally,this paper discusses the role of AI in self-optimization and lifelong learning,positing that the future of AM will be characterized by a symbiotic relationship between human expertise and advanced autonomy,fostering a more adaptive,resilient manufacturing ecosystem.
基金Project supported by the Consulting Project of the Chinese Academy of Engineering(No.2023-XY-09)the National Natural Science Foundation of China(No.62272100)the Academy-Locality Cooperation Project of the Chinese Academy of Engineering(No.JS2021ZT05)。
文摘The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion,which potentially cause risks to the security of Web content and complicate the governance of sensitive words.Most existing approaches predominantly prioritize the acquisition of contextual knowledge from Chinese corpora and vocabularies during pretraining,often overlooking the inherent phonological and morphological characteristics of the Chinese language.To address these issues,we propose a shared-weight multimodal translation model(SMTM)based on multimodal information of Chinese characters,which integrates the phonology of Pinyin and the morphology of fonts into each Chinese character token to learn the deeper semantics of variant text.Specifically,we encode the Pinyin features of Chinese characters using the embedding layer,and the font features of Chinese characters are extracted based on convolutional neural networks directly.Considering the multimodal similarity between the source and target sentences of the Chinese variant-character-recognition task,we design the shared-weight embedding mechanism to generate target sentences using the heuristic information from the source sentences in the training process.The simulation results show that our proposed SMTM achieves remarkable performance of 89.550%and 79.480%on bilingual evaluation understudy(BLEU)and F1 metrics respectively,with significant improvement compared with state-of-the-art baseline models.
基金supported by the Guangdong Provincial Science and Technology Program(Grant No.2023A0505030003).
文摘Can current robotic technologies truly replicate the full scope and intricacies of human labour?In practice,the adoption of robots remains limited,especially in open,unstructured environments commonly encountered in everyday scenarios such as services,healthcare,agriculture,construction,and numerous other fields.From the perspective of general robotic manipulation,the challenges arise from three factors.(1)High operational barriers:human operators are obliged to master specialized robotic programming languages and gain a deep understanding of the tasks at hand.These tasks need to be broken down into action-level robotic programs,which results in high labour costs.(2)Limited autonomous task execution:robots lack the capability to independently plan and execute actions required to achieve the target tasks.This limitation renders them unsuitable for deployment in open,unstructured environments that demand sophisticated interaction and seamless collaboration with humans.
基金supported by the National Natural Science Foundation of China(Grant No.:62172458).
文摘The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice.