Multimodal sensor fusion can make full use of the advantages of various sensors,make up for the shortcomings of a single sensor,achieve information verification or information security through information redundancy,a...Multimodal sensor fusion can make full use of the advantages of various sensors,make up for the shortcomings of a single sensor,achieve information verification or information security through information redundancy,and improve the reliability and safety of the system.Artificial intelligence(AI),referring to the simulation of human intelligence in machines that are programmed to think and learn like humans,represents a pivotal frontier in modern scientific research.With the continuous development and promotion of AI technology in Sensor 4.0 age,multimodal sensor fusion is becoming more and more intelligent and automated,and is expected to go further in the future.With this context,this review article takes a comprehensive look at the recent progress on AI-enhanced multimodal sensors and their integrated devices and systems.Based on the concept and principle of sensor technologies and AI algorithms,the theoretical underpinnings,technological breakthroughs,and pragmatic applications of AI-enhanced multimodal sensors in various fields such as robotics,healthcare,and environmental monitoring are highlighted.Through a comparative study of the dual/tri-modal sensors with and without using AI technologies(especially machine learning and deep learning),AI-enhanced multimodal sensors highlight the potential of AI to improve sensor performance,data processing,and decision-making capabilities.Furthermore,the review analyzes the challenges and opportunities afforded by AI-enhanced multimodal sensors,and offers a prospective outlook on the forthcoming advancements.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(...With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well.展开更多
Against the backdrop of active global responses to climate change and the accelerated green and low-carbon energy transition,the co-optimization and innovative mechanism design of multimodal energy systems have become...Against the backdrop of active global responses to climate change and the accelerated green and low-carbon energy transition,the co-optimization and innovative mechanism design of multimodal energy systems have become a significant instrument for propelling the energy revolution and ensuring energy security.Under increasingly stringent carbon emission constraints,how to achieve multi-dimensional improvements in energy utilization efficiency,renewable energy accommodation levels,and system economics-through the intelligent coupling of diverse energy carriers such as electricity,heat,natural gas,and hydrogen,and the effective application of market-based instruments like carbon trading and demand response-constitutes a critical scientific and engineering challenge demanding urgent solutions.展开更多
BACKGROUND Recent advancements in artificial intelligence(AI)have significantly enhanced the capabilities of endoscopic-assisted diagnosis for gastrointestinal diseases.AI has shown great promise in clinical practice,...BACKGROUND Recent advancements in artificial intelligence(AI)have significantly enhanced the capabilities of endoscopic-assisted diagnosis for gastrointestinal diseases.AI has shown great promise in clinical practice,particularly for diagnostic support,offering real-time insights into complex conditions such as esophageal squamous cell carcinoma.CASE SUMMARY In this study,we introduce a multimodal AI system that successfully identified and delineated a small and flat carcinoma during esophagogastroduodenoscopy,highlighting its potential for early detection of malignancies.The lesion was confirmed as high-grade squamous intraepithelial neoplasia,with pathology results supporting the AI system’s accuracy.The multimodal AI system offers an integrated solution that provides real-time,accurate diagnostic information directly within the endoscopic device interface,allowing for single-monitor use without disrupting endoscopist’s workflow.CONCLUSION This work underscores the transformative potential of AI to enhance endoscopic diagnosis by enabling earlier,more accurate interventions.展开更多
Since the first design of tactile sensors was proposed by Harmon in 1982,tactile sensors have evolved through four key phases:industrial applications(1980s,basic pressure detection),miniaturization via MEMS(1990s),fle...Since the first design of tactile sensors was proposed by Harmon in 1982,tactile sensors have evolved through four key phases:industrial applications(1980s,basic pressure detection),miniaturization via MEMS(1990s),flexible electronics(2010s,stretchable materials),and intelligent systems(2020s-present,AI-driven multimodal sensing).With the innovation of material,processing techniques,and multimodal fusion of stimuli,the application of tactile sensors has been continuously expanding to a diversity of areas,including but not limited to medical care,aerospace,sports and intelligent robots.Currently,researchers are dedicated to develop tactile sensors with emerging mechanisms and structures,pursuing high-sensitivity,high-resolution,and multimodal characteristics and further constructing tactile systems which imitate and approach the performance of human organs.However,challenges in the combination between the theoretical research and the practical applications are still significant.There is a lack of comprehensive understanding in the state of the art of such knowledge transferring from academic work to technical products.Scaled-up production of laboratory materials faces fatal challenges like high costs,small scale,and inconsistent quality.Ambient factors,such as temperature,humidity,and electromagnetic interference,also impair signal reliability.Moreover,tactile sensors must operate across a wide pressure range(0.1 k Pa to several or even dozens of MPa)to meet diverse application needs.Meanwhile,the existing algorithms,data models and sensing systems commonly reveal insufficient precision as well as undesired robustness in data processing,and there is a realistic gap between the designed and the demanded system response speed.In this review,oriented by the design requirements of intelligent tactile sensing systems,we summarize the common sensing mechanisms,inspired structures,key performance,and optimizing strategies,followed by a brief overview of the recent advances in the perspectives of system integration and algorithm implementation,and the possible roadmap of future development of tactile sensors,providing a forward-looking as well as critical discussions in the future industrial applications of flexible tactile sensors.展开更多
Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical applicati...Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical application of multi-omics parameters is still restricted by the expensive and less accessible assays,although they accurately reflect immune status.A comprehensive evaluation framework based on“easy-to-obtain”multi-model clinical parameters is urgently required,incorporating clinical features to establish baseline patient profiles and disease staging;routine blood tests assessing systemic metabolic and functional status;immune cell subsets quantifying subcluster dynamics;imaging features delineating tumor morphology,spatial configuration,and perilesional anatomical relationships;immunohistochemical markers positioning qualitative and quantitative detection of tumor antigens from the cellular and molecular level.This integrated phenomic approach aims to improve prognostic stratification and clinical decision-making in hepatocellular carcinoma management conveniently and practically.展开更多
Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and requir...Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and require a high level of expertise.This article proposes an innovative alternative solution that overcomes these limitations by automatically generating comprehensive Business Process Modelling and Notation(BPMN)diagrams solely from verbal descriptions of the processes to be modeled,utilizing Large Language Models(LLMs)and multimodal Artificial Intelligence(AI).Experimental results,based on video recordings of process explanations provided by an expert from an organization(in this case,the Commercial Courts of a public justice administration),demonstrate that the proposed methodology successfully enables the automatic generation of complete and accurate BPMN diagrams,leading to significant improvements in the speed,accuracy,and accessibility of process modeling.This research makes a substantial contribution to the field of business process modeling,as its methodology is groundbreaking in its use of LLMs and multimodal AI capabilities to handle different types of source material(text and video),combining several tools to minimize the number of queries and reduce the complexity of the prompts required for the automatic generation of successful BPMN diagrams.展开更多
In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we devel...In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we develop a multimodal framework that integrates symbolic task reasoning with continuous trajectory generation.The approach employs transformer models and adversarial training to map high-level intent to robotic motion.Information from multiple data sources,such as voice traits,hand and body keypoints,visual observations,and recorded paths,is integrated simultaneously.These signals are mapped into a shared representation that supports interpretable reasoning while enabling smooth and realistic motion generation.Based on this design,two different learning strategies are investigated.In the first step,grammar-constrained Linear Temporal Logic(LTL)expressions are created from multimodal human inputs.These expressions are subsequently decoded into robot trajectories.The second method generates trajectories directly from symbolic intent and linguistic data,bypassing an intermediate logical representation.Transformer encoders combine multiple types of information,and autoregressive transformer decoders generate motion sequences.Adding smoothness and speed limits during training increases the likelihood of physical feasibility.To improve the realism and stability of the generated trajectories during training,an adversarial discriminator is also included to guide them toward the distribution of actual robot motion.Tests on the NATSGLD dataset indicate that the complete system exhibits stable training behaviour and performance.In normalised coordinates,the logic-based pipeline has an Average Displacement Error(ADE)of 0.040 and a Final Displacement Error(FDE)of 0.036.The adversarial generator makes substantially more progress,reducing ADE to 0.021 and FDE to 0.018.Visual examination confirms that the generated trajectories closely align with observed motion patterns while preserving smooth temporal dynamics.展开更多
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L...The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.展开更多
The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytot...The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytotoxicity.These limitations have catalyzed the development of intelligent stimuli-responsive block copolymers-based bioimaging agents,which was engineered to dynamically respond to endogenous biochemical cues(e.g.,p H gradients,redox potential,enzyme activity,hypoxia environment) or exogenous physical triggers(e.g.,photoirradiation,thermal gradients,ultrasound(US)/magnetic stimuli).Through spatiotemporally controlled structural transformations,stimuli-responsive block copolymers enable precise contrast targeting,activatable signal amplification,and theranostic integration,thereby substantially enhancing signal-to-noise ratios of bioimaging and diagnostic specificity.Hence,this mini-review systematically examines molecular engineering principles for designing p H-,redox-,enzyme-,light-,thermo-,and US/magnetic-responsive polymers,with emphasis on structure-property relationships governing imaging performance modulation.Furthermore,we critically analyze emerging strategies for optical imaging,US synergies,and magnetic resonance imaging(MRI).Multimodal bioimaging has also been elaborated,which could overcome the inherent trade-offs between resolution,penetration depth,and functional specificity in single-modal approaches.By elucidating mechanistic insights and translational challenges,this mini-review aims to establish a design framework of stimuli-responsive block copolymersbased for high fidelity bioimaging agents and accelerate their clinical translation in precise diagnosis and therapy.展开更多
Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including ...Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including computed tomography(CT),magnetic resonance imaging(MRI),endoscopic imaging,and genomic profiles-to enable intelligent decision-making for individualized therapy.This approach leverages AI algorithms to fuse imaging,endoscopic,and omics data,facilitating comprehensive characterization of tumor biology,prediction of treatment response,and optimization of therapeutic strategies.By combining CT and MRI for structural assessment,endoscopic data for real-time visual inspection,and genomic information for molecular profiling,multimodal AI enhances the accuracy of patient stratification and treatment personalization.The clinical implementation of this technology demonstrates potential for improving patient outcomes,advancing precision oncology,and supporting individualized care in gastrointestinal cancers.Ultimately,multimodal AI serves as a transformative tool in oncology,bridging data integration with clinical application to effectively tailor therapies.展开更多
Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conductin...Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conducting ECG-based studies.From a review of existing studies,two main factors appear to contribute to this problem:the uneven distribution of arrhythmia classes and the limited expressiveness of features learned by current models.To overcome these limitations,this study proposes a dual-path multimodal framework,termed DM-EHC(Dual-Path Multimodal ECG Heartbeat Classifier),for ECG-based heartbeat classification.The proposed framework links 1D ECG temporal features with 2D time–frequency features.By setting up the dual paths described above,the model can process more dimensions of feature information.The MIT-BIH arrhythmia database was selected as the baseline dataset for the experiments.Experimental results show that the proposed method outperforms single modalities and performs better for certain specific types of arrhythmias.The model achieved mean precision,recall,and F1 score of 95.14%,92.26%,and 93.65%,respectively.These results indicate that the framework is robust and has potential value in automated arrhythmia classification.展开更多
It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This stu...It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.展开更多
The brain atlas,or parcellation-delineating spatial partitions,organizes the brain's structure and function[1].The spatial arrangements of highly heterogeneous landscapes represent specialized functional regions f...The brain atlas,or parcellation-delineating spatial partitions,organizes the brain's structure and function[1].The spatial arrangements of highly heterogeneous landscapes represent specialized functional regions for investigating their interactions.Early efforts to parcellate the mammalian brain,using histological cytoarchitecture and myeloarchitecture,as well as recent in vivo magnetic resonance imaging(MRl)[2,3],have primarily involved cortical areas,subcortical structures,and cerebellar nuclei.Human brain parcellations primarily focus on grey matter(GM),which purposefully excludes white matter(WM),hindering the development of next-generation brain atlases.展开更多
For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-base...For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-based classification to site-specific treatment guidelines.Yet,this framework catastrophically fails for a substantial population of patients diagnosed with cancer of unknown primary(CUP).These patients present metastatic disease,yet their primary tumors remain elusive despite exhaustive clinical workup1.CUP,accounting for 1%-3%of all cancer diagnoses,is an enigma with devastating consequences;the median overall survival is only 2-12 months2-4.The inability to pinpoint an origin forces clinicians to rely on broad-spectrum empirical chemotherapy,such as taxane-carboplatin regimens,which have limited efficacy and exclude patients from the promise of targeted therapies and clinical trials5.CUP is not only a diagnostic challenge but also an indictment of the siloed approach to understanding malignancy:this cancer highlights the limitations of origin-based diagnostic frameworks.However,the confluence of high-dimensional biological data and advanced artificial intelligence(AI)is now poised to address this long-standing diagnostic limitation and to herald a new era for not only CUP but also oncology as a whole(Figure 1).展开更多
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ...High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).展开更多
To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and ex...To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and expert experience,which limits their adaptability under variable operating conditions and strong noise environments,severely affecting the generalization capability of diagnostic models.To address this issue,this study proposes a multimodal fusion fault diagnosis framework based on Mel-spectrograms and automated machine learning(AutoML).The framework first extracts fault-sensitive Mel time–frequency features from acoustic signals and fuses them with statistical features of vibration signals to construct complementary fault representations.On this basis,automated machine learning techniques are introduced to enable end-to-end diagnostic workflow construction and optimal model configuration acquisition.Finally,diagnostic decisions are achieved by automatically integrating the predictions of multiple high-performance base models.Experimental results on a centrifugal pump vibration and acoustic dataset demonstrate that the proposed framework achieves high diagnostic accuracy under noise-free conditions and maintains strong robustness under noisy interference,validating its efficiency,scalability,and practical value for rotating machinery fault diagnosis.展开更多
基金supported by the National Natural Science Foundation of China(No.62404111)Natural Science Foundation of Jiangsu Province(No.BK20240635)+2 种基金Natural Science Foundation of the Jiangsu Higher Education Institutions of China(No.24KJB510025)Natural Science Research Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications(No.NY223157 and NY223156)Opening Project of Advanced Inte-grated Circuit Package and Testing Research Center of Jiangsu Province(No.NTIKFJJ202303).
文摘Multimodal sensor fusion can make full use of the advantages of various sensors,make up for the shortcomings of a single sensor,achieve information verification or information security through information redundancy,and improve the reliability and safety of the system.Artificial intelligence(AI),referring to the simulation of human intelligence in machines that are programmed to think and learn like humans,represents a pivotal frontier in modern scientific research.With the continuous development and promotion of AI technology in Sensor 4.0 age,multimodal sensor fusion is becoming more and more intelligent and automated,and is expected to go further in the future.With this context,this review article takes a comprehensive look at the recent progress on AI-enhanced multimodal sensors and their integrated devices and systems.Based on the concept and principle of sensor technologies and AI algorithms,the theoretical underpinnings,technological breakthroughs,and pragmatic applications of AI-enhanced multimodal sensors in various fields such as robotics,healthcare,and environmental monitoring are highlighted.Through a comparative study of the dual/tri-modal sensors with and without using AI technologies(especially machine learning and deep learning),AI-enhanced multimodal sensors highlight the potential of AI to improve sensor performance,data processing,and decision-making capabilities.Furthermore,the review analyzes the challenges and opportunities afforded by AI-enhanced multimodal sensors,and offers a prospective outlook on the forthcoming advancements.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
文摘With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well.
文摘Against the backdrop of active global responses to climate change and the accelerated green and low-carbon energy transition,the co-optimization and innovative mechanism design of multimodal energy systems have become a significant instrument for propelling the energy revolution and ensuring energy security.Under increasingly stringent carbon emission constraints,how to achieve multi-dimensional improvements in energy utilization efficiency,renewable energy accommodation levels,and system economics-through the intelligent coupling of diverse energy carriers such as electricity,heat,natural gas,and hydrogen,and the effective application of market-based instruments like carbon trading and demand response-constitutes a critical scientific and engineering challenge demanding urgent solutions.
基金Supported by the 135 High-end Talent Project of West China Hospital,Sichuan University,No.ZYDG23029.
文摘BACKGROUND Recent advancements in artificial intelligence(AI)have significantly enhanced the capabilities of endoscopic-assisted diagnosis for gastrointestinal diseases.AI has shown great promise in clinical practice,particularly for diagnostic support,offering real-time insights into complex conditions such as esophageal squamous cell carcinoma.CASE SUMMARY In this study,we introduce a multimodal AI system that successfully identified and delineated a small and flat carcinoma during esophagogastroduodenoscopy,highlighting its potential for early detection of malignancies.The lesion was confirmed as high-grade squamous intraepithelial neoplasia,with pathology results supporting the AI system’s accuracy.The multimodal AI system offers an integrated solution that provides real-time,accurate diagnostic information directly within the endoscopic device interface,allowing for single-monitor use without disrupting endoscopist’s workflow.CONCLUSION This work underscores the transformative potential of AI to enhance endoscopic diagnosis by enabling earlier,more accurate interventions.
基金the financial support of the National Natural Science Foundation of China(NO.52173028)。
文摘Since the first design of tactile sensors was proposed by Harmon in 1982,tactile sensors have evolved through four key phases:industrial applications(1980s,basic pressure detection),miniaturization via MEMS(1990s),flexible electronics(2010s,stretchable materials),and intelligent systems(2020s-present,AI-driven multimodal sensing).With the innovation of material,processing techniques,and multimodal fusion of stimuli,the application of tactile sensors has been continuously expanding to a diversity of areas,including but not limited to medical care,aerospace,sports and intelligent robots.Currently,researchers are dedicated to develop tactile sensors with emerging mechanisms and structures,pursuing high-sensitivity,high-resolution,and multimodal characteristics and further constructing tactile systems which imitate and approach the performance of human organs.However,challenges in the combination between the theoretical research and the practical applications are still significant.There is a lack of comprehensive understanding in the state of the art of such knowledge transferring from academic work to technical products.Scaled-up production of laboratory materials faces fatal challenges like high costs,small scale,and inconsistent quality.Ambient factors,such as temperature,humidity,and electromagnetic interference,also impair signal reliability.Moreover,tactile sensors must operate across a wide pressure range(0.1 k Pa to several or even dozens of MPa)to meet diverse application needs.Meanwhile,the existing algorithms,data models and sensing systems commonly reveal insufficient precision as well as undesired robustness in data processing,and there is a realistic gap between the designed and the demanded system response speed.In this review,oriented by the design requirements of intelligent tactile sensing systems,we summarize the common sensing mechanisms,inspired structures,key performance,and optimizing strategies,followed by a brief overview of the recent advances in the perspectives of system integration and algorithm implementation,and the possible roadmap of future development of tactile sensors,providing a forward-looking as well as critical discussions in the future industrial applications of flexible tactile sensors.
文摘Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical application of multi-omics parameters is still restricted by the expensive and less accessible assays,although they accurately reflect immune status.A comprehensive evaluation framework based on“easy-to-obtain”multi-model clinical parameters is urgently required,incorporating clinical features to establish baseline patient profiles and disease staging;routine blood tests assessing systemic metabolic and functional status;immune cell subsets quantifying subcluster dynamics;imaging features delineating tumor morphology,spatial configuration,and perilesional anatomical relationships;immunohistochemical markers positioning qualitative and quantitative detection of tumor antigens from the cellular and molecular level.This integrated phenomic approach aims to improve prognostic stratification and clinical decision-making in hepatocellular carcinoma management conveniently and practically.
基金funded by Fundación CajaCanarias and Fundación Bancaria“la Caixa”,grant number 2023DIG11.
文摘Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and require a high level of expertise.This article proposes an innovative alternative solution that overcomes these limitations by automatically generating comprehensive Business Process Modelling and Notation(BPMN)diagrams solely from verbal descriptions of the processes to be modeled,utilizing Large Language Models(LLMs)and multimodal Artificial Intelligence(AI).Experimental results,based on video recordings of process explanations provided by an expert from an organization(in this case,the Commercial Courts of a public justice administration),demonstrate that the proposed methodology successfully enables the automatic generation of complete and accurate BPMN diagrams,leading to significant improvements in the speed,accuracy,and accessibility of process modeling.This research makes a substantial contribution to the field of business process modeling,as its methodology is groundbreaking in its use of LLMs and multimodal AI capabilities to handle different types of source material(text and video),combining several tools to minimize the number of queries and reduce the complexity of the prompts required for the automatic generation of successful BPMN diagrams.
基金The authors extend their appreciation to Prince Sattam bin Abdulaziz University for funding this research work through the project number(PSAU/2024/01/32082).
文摘In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we develop a multimodal framework that integrates symbolic task reasoning with continuous trajectory generation.The approach employs transformer models and adversarial training to map high-level intent to robotic motion.Information from multiple data sources,such as voice traits,hand and body keypoints,visual observations,and recorded paths,is integrated simultaneously.These signals are mapped into a shared representation that supports interpretable reasoning while enabling smooth and realistic motion generation.Based on this design,two different learning strategies are investigated.In the first step,grammar-constrained Linear Temporal Logic(LTL)expressions are created from multimodal human inputs.These expressions are subsequently decoded into robot trajectories.The second method generates trajectories directly from symbolic intent and linguistic data,bypassing an intermediate logical representation.Transformer encoders combine multiple types of information,and autoregressive transformer decoders generate motion sequences.Adding smoothness and speed limits during training increases the likelihood of physical feasibility.To improve the realism and stability of the generated trajectories during training,an adversarial discriminator is also included to guide them toward the distribution of actual robot motion.Tests on the NATSGLD dataset indicate that the complete system exhibits stable training behaviour and performance.In normalised coordinates,the logic-based pipeline has an Average Displacement Error(ADE)of 0.040 and a Final Displacement Error(FDE)of 0.036.The adversarial generator makes substantially more progress,reducing ADE to 0.021 and FDE to 0.018.Visual examination confirms that the generated trajectories closely align with observed motion patterns while preserving smooth temporal dynamics.
文摘The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.
基金supported by the National Natural Science Foundation of China (Nos.22208218,22078196,and 22278268)the Natural Science Foundation of Shanghai (No.22ZR1460400)Collaborative Innovation Center of Fragrance Flavour and Cosmetics,and Collaborative Innovation Project of Shanghai Institute of Technology (No.XTCX2023-07)。
文摘The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytotoxicity.These limitations have catalyzed the development of intelligent stimuli-responsive block copolymers-based bioimaging agents,which was engineered to dynamically respond to endogenous biochemical cues(e.g.,p H gradients,redox potential,enzyme activity,hypoxia environment) or exogenous physical triggers(e.g.,photoirradiation,thermal gradients,ultrasound(US)/magnetic stimuli).Through spatiotemporally controlled structural transformations,stimuli-responsive block copolymers enable precise contrast targeting,activatable signal amplification,and theranostic integration,thereby substantially enhancing signal-to-noise ratios of bioimaging and diagnostic specificity.Hence,this mini-review systematically examines molecular engineering principles for designing p H-,redox-,enzyme-,light-,thermo-,and US/magnetic-responsive polymers,with emphasis on structure-property relationships governing imaging performance modulation.Furthermore,we critically analyze emerging strategies for optical imaging,US synergies,and magnetic resonance imaging(MRI).Multimodal bioimaging has also been elaborated,which could overcome the inherent trade-offs between resolution,penetration depth,and functional specificity in single-modal approaches.By elucidating mechanistic insights and translational challenges,this mini-review aims to establish a design framework of stimuli-responsive block copolymersbased for high fidelity bioimaging agents and accelerate their clinical translation in precise diagnosis and therapy.
基金Supported by Xuhui District Health Commission,No.SHXH202214.
文摘Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including computed tomography(CT),magnetic resonance imaging(MRI),endoscopic imaging,and genomic profiles-to enable intelligent decision-making for individualized therapy.This approach leverages AI algorithms to fuse imaging,endoscopic,and omics data,facilitating comprehensive characterization of tumor biology,prediction of treatment response,and optimization of therapeutic strategies.By combining CT and MRI for structural assessment,endoscopic data for real-time visual inspection,and genomic information for molecular profiling,multimodal AI enhances the accuracy of patient stratification and treatment personalization.The clinical implementation of this technology demonstrates potential for improving patient outcomes,advancing precision oncology,and supporting individualized care in gastrointestinal cancers.Ultimately,multimodal AI serves as a transformative tool in oncology,bridging data integration with clinical application to effectively tailor therapies.
基金supported by the Innovative Human Resource Development for Local Intel-lectualization program through the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.IITP-2026-2020-0-01741)the research fund of Hanyang University(HY-2025-1110).
文摘Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conducting ECG-based studies.From a review of existing studies,two main factors appear to contribute to this problem:the uneven distribution of arrhythmia classes and the limited expressiveness of features learned by current models.To overcome these limitations,this study proposes a dual-path multimodal framework,termed DM-EHC(Dual-Path Multimodal ECG Heartbeat Classifier),for ECG-based heartbeat classification.The proposed framework links 1D ECG temporal features with 2D time–frequency features.By setting up the dual paths described above,the model can process more dimensions of feature information.The MIT-BIH arrhythmia database was selected as the baseline dataset for the experiments.Experimental results show that the proposed method outperforms single modalities and performs better for certain specific types of arrhythmias.The model achieved mean precision,recall,and F1 score of 95.14%,92.26%,and 93.65%,respectively.These results indicate that the framework is robust and has potential value in automated arrhythmia classification.
基金supported by the DH2025-TN07-07 project conducted at the Thai Nguyen University of Information and Communication Technology,Thai Nguyen,Vietnam,with additional support from the AI in Software Engineering Lab.
文摘It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.
基金supported by the National Natural Science Foundation of China(62473082,82202250,82121003,62036003,and 62333003)the Fundamental Research Funds for the Central Universities(ZYGX2022YGRH008 and ZYGX2024XJ054)the Medical-Engineering Cooperation Funds from the University of Electronic Science and Technology of China(ZYGX2021YGLH201).
文摘The brain atlas,or parcellation-delineating spatial partitions,organizes the brain's structure and function[1].The spatial arrangements of highly heterogeneous landscapes represent specialized functional regions for investigating their interactions.Early efforts to parcellate the mammalian brain,using histological cytoarchitecture and myeloarchitecture,as well as recent in vivo magnetic resonance imaging(MRl)[2,3],have primarily involved cortical areas,subcortical structures,and cerebellar nuclei.Human brain parcellations primarily focus on grey matter(GM),which purposefully excludes white matter(WM),hindering the development of next-generation brain atlases.
基金supported by the National Natural Science Foundation of China(Grant Nos.32270688,31801117,and 82430107 to X.L.,and 32500589 to H.S.)the China Postdoctoral Science Foundation(Grant Nos.BX20240253 and 2024M762384 to H.S.)+1 种基金the Natural Science Foundation of Tianjin(Grant No.24JCQNJC01280 to H.S.)Tianjin Key Medical Discipline(Specialty)Construction Project(Grant No.TJYXZDXK-3-003A).
文摘For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-based classification to site-specific treatment guidelines.Yet,this framework catastrophically fails for a substantial population of patients diagnosed with cancer of unknown primary(CUP).These patients present metastatic disease,yet their primary tumors remain elusive despite exhaustive clinical workup1.CUP,accounting for 1%-3%of all cancer diagnoses,is an enigma with devastating consequences;the median overall survival is only 2-12 months2-4.The inability to pinpoint an origin forces clinicians to rely on broad-spectrum empirical chemotherapy,such as taxane-carboplatin regimens,which have limited efficacy and exclude patients from the promise of targeted therapies and clinical trials5.CUP is not only a diagnostic challenge but also an indictment of the siloed approach to understanding malignancy:this cancer highlights the limitations of origin-based diagnostic frameworks.However,the confluence of high-dimensional biological data and advanced artificial intelligence(AI)is now poised to address this long-standing diagnostic limitation and to herald a new era for not only CUP but also oncology as a whole(Figure 1).
文摘High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).
基金supported by the National Key R&D Program of China(2023YFA1406200)the National Natural Science Foundation of China(T2521005,12174144,12474009,12174146,and 124B2059)the Special Construction Project Fund for Shan-dong Province Taishan Scholars.
文摘Multifunctional optical responsive materials have grown increasingly pivotal in addressingthe escalating demands of sensing,detection,and anti-counterfeiting applications[1,2].These materials exhibit distinct visible optical variations upon exposure to external stimuli,such as pressure,temperature,light,solvents,pH fluctuations,or mechanical force.Fluorescent sensing and anti-counterfeiting technologies leveraging these optical responses have emerged as highly promising solutions.
基金supported in part by the National Natural Science Foundation of China under Grants 52475102 and 52205101in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515240021+1 种基金in part by the Young Talent Support Project of Guangzhou Association for Science and Technology(QT-2024-28)in part by the Youth Development Initiative of Guangdong Association for Science and Technology(SKXRC2025254).
文摘To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and expert experience,which limits their adaptability under variable operating conditions and strong noise environments,severely affecting the generalization capability of diagnostic models.To address this issue,this study proposes a multimodal fusion fault diagnosis framework based on Mel-spectrograms and automated machine learning(AutoML).The framework first extracts fault-sensitive Mel time–frequency features from acoustic signals and fuses them with statistical features of vibration signals to construct complementary fault representations.On this basis,automated machine learning techniques are introduced to enable end-to-end diagnostic workflow construction and optimal model configuration acquisition.Finally,diagnostic decisions are achieved by automatically integrating the predictions of multiple high-performance base models.Experimental results on a centrifugal pump vibration and acoustic dataset demonstrate that the proposed framework achieves high diagnostic accuracy under noise-free conditions and maintains strong robustness under noisy interference,validating its efficiency,scalability,and practical value for rotating machinery fault diagnosis.