Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approa...Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.展开更多
Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks ...Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks is robust sensing under the coexistence of low signal-to-noise ratios(SNRs)and non-Gaussian impulsive noise,where observations may be distorted differently across feature modalities,making conventional fusion unstable and degrading detection reliability.To address this challenge,the generalized Gaussian distribution(GGD)is adopted as the noise model,and a multimodal fusion framework termed BCAM-Net(bidirectional cross-attention multimodal network)is proposed.BCAM-Net adopts a parallel dual-branch architecture:a time-frequency branch that leverages the continuous wavelet transform(CWT)to extract time-frequency representations,and a temporal branch that learns long-range dependencies from raw signals.BCAM-Net utilizes a bidirectional cross-attention mechanism to achieve deep alignment and mutual calibration of temporal and time-frequency features,generating a fused representation that is highly robust to complex noise.Simulation results show that,under GGD noise with shape parameterβ=0.5,BCAM-Net achieves high detection probabilities in the low-SNR regime and outperforms representative baselines.At a false alarm probability Pf=0.1 and SNR of−14 dB,it attains a detection probability of 0.9020,exceeding the CNN-Transformer,WT-ResNet,TFCFN,and conventional CNN benchmarks by 5.75%,6.98%,33.3%,and 21.1%,respectively.These results indicate that BCAM-Net can effectively improve spectrum sensing performance in low-SNR impulsive-noise scenarios,and provides a lightweight,high-performance solution for practical cognitive radio spectrum sensing.展开更多
Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical applicati...Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical application of multi-omics parameters is still restricted by the expensive and less accessible assays,although they accurately reflect immune status.A comprehensive evaluation framework based on“easy-to-obtain”multi-model clinical parameters is urgently required,incorporating clinical features to establish baseline patient profiles and disease staging;routine blood tests assessing systemic metabolic and functional status;immune cell subsets quantifying subcluster dynamics;imaging features delineating tumor morphology,spatial configuration,and perilesional anatomical relationships;immunohistochemical markers positioning qualitative and quantitative detection of tumor antigens from the cellular and molecular level.This integrated phenomic approach aims to improve prognostic stratification and clinical decision-making in hepatocellular carcinoma management conveniently and practically.展开更多
Detecting fake news in multimodal and multilingual social media environments is challenging due to inherent noise,inter-modal imbalance,computational bottlenecks,and semantic ambiguity.To address these issues,we propo...Detecting fake news in multimodal and multilingual social media environments is challenging due to inherent noise,inter-modal imbalance,computational bottlenecks,and semantic ambiguity.To address these issues,we propose SparseMoE-MFN,a novel unified framework that integrates sparse attention with a sparse-activated Mixture of-Experts(MoE)architecture.This framework aims to enhance the efficiency,inferential depth,and interpretability of multimodal fake news detection.Sparse MoE-MFN leverages LLaVA-v1.6-Mistral-7B-HF for efficient visual encoding and Qwen/Qwen2-7B for text processing.The sparse attention module adaptively filters irrelevant tokens and focuses on key regions,reducing computational costs and noise.The sparse MoE module dynamically routes inputs to specialized experts(visual,language,cross-modal alignment)based on content heterogeneity.This expert specialization design boosts computational efficiency and semantic adaptability,enabling precise processing of complex content and improving performance on ambiguous categories.Evaluated on the large-scale,multilingualMR2 dataset,SparseMoEMFN achieves state-of-the-art performance.It obtains an accuracy of 86.7%and a macro-averaged F1 score of 0.859,outperforming strong baselines like MiniGPT-4 by 3.4%and 3.2%,respectively.Notably,it shows significant advantages in the“unverified”category.Furthermore,SparseMoE-MFN demonstrates superior computational efficiency,with an average inference latency of 89.1 ms and 95.4 GFLOPs,substantially lower than existing models.Ablation studies and visualization analyses confirm the effectiveness of both sparse attention and sparse MoE components in improving accuracy,generalization,and efficiency.展开更多
Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and requir...Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and require a high level of expertise.This article proposes an innovative alternative solution that overcomes these limitations by automatically generating comprehensive Business Process Modelling and Notation(BPMN)diagrams solely from verbal descriptions of the processes to be modeled,utilizing Large Language Models(LLMs)and multimodal Artificial Intelligence(AI).Experimental results,based on video recordings of process explanations provided by an expert from an organization(in this case,the Commercial Courts of a public justice administration),demonstrate that the proposed methodology successfully enables the automatic generation of complete and accurate BPMN diagrams,leading to significant improvements in the speed,accuracy,and accessibility of process modeling.This research makes a substantial contribution to the field of business process modeling,as its methodology is groundbreaking in its use of LLMs and multimodal AI capabilities to handle different types of source material(text and video),combining several tools to minimize the number of queries and reduce the complexity of the prompts required for the automatic generation of successful BPMN diagrams.展开更多
Traditional artificial intelligence(AI)-based methods for breast cancer diagnosis often rely on a single modality,such as ultrasound images.With the rise of multimodal approaches,multiple data sources,including imagin...Traditional artificial intelligence(AI)-based methods for breast cancer diagnosis often rely on a single modality,such as ultrasound images.With the rise of multimodal approaches,multiple data sources,including imaging from diverse medical modalities,structured clinical information,and unstructured medical reports,are increasingly integrated to provide richer and more informative signals for model training.This survey reviews the data modalities employed in AI-based breast cancer research,examines common multimodal combinations and fusion strategies,and discusses their applications across clinical tasks such as diagnosis,treatment planning,and outcome prediction.By consolidating current literature and identifying critical gaps,this survey aims to guide future research toward the development of reliable,clinically relevant multimodal AI systems for use in breast cancer management.展开更多
Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and rob...Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios.展开更多
In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we devel...In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we develop a multimodal framework that integrates symbolic task reasoning with continuous trajectory generation.The approach employs transformer models and adversarial training to map high-level intent to robotic motion.Information from multiple data sources,such as voice traits,hand and body keypoints,visual observations,and recorded paths,is integrated simultaneously.These signals are mapped into a shared representation that supports interpretable reasoning while enabling smooth and realistic motion generation.Based on this design,two different learning strategies are investigated.In the first step,grammar-constrained Linear Temporal Logic(LTL)expressions are created from multimodal human inputs.These expressions are subsequently decoded into robot trajectories.The second method generates trajectories directly from symbolic intent and linguistic data,bypassing an intermediate logical representation.Transformer encoders combine multiple types of information,and autoregressive transformer decoders generate motion sequences.Adding smoothness and speed limits during training increases the likelihood of physical feasibility.To improve the realism and stability of the generated trajectories during training,an adversarial discriminator is also included to guide them toward the distribution of actual robot motion.Tests on the NATSGLD dataset indicate that the complete system exhibits stable training behaviour and performance.In normalised coordinates,the logic-based pipeline has an Average Displacement Error(ADE)of 0.040 and a Final Displacement Error(FDE)of 0.036.The adversarial generator makes substantially more progress,reducing ADE to 0.021 and FDE to 0.018.Visual examination confirms that the generated trajectories closely align with observed motion patterns while preserving smooth temporal dynamics.展开更多
Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including ...Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including computed tomography(CT),magnetic resonance imaging(MRI),endoscopic imaging,and genomic profiles-to enable intelligent decision-making for individualized therapy.This approach leverages AI algorithms to fuse imaging,endoscopic,and omics data,facilitating comprehensive characterization of tumor biology,prediction of treatment response,and optimization of therapeutic strategies.By combining CT and MRI for structural assessment,endoscopic data for real-time visual inspection,and genomic information for molecular profiling,multimodal AI enhances the accuracy of patient stratification and treatment personalization.The clinical implementation of this technology demonstrates potential for improving patient outcomes,advancing precision oncology,and supporting individualized care in gastrointestinal cancers.Ultimately,multimodal AI serves as a transformative tool in oncology,bridging data integration with clinical application to effectively tailor therapies.展开更多
The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytot...The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytotoxicity.These limitations have catalyzed the development of intelligent stimuli-responsive block copolymers-based bioimaging agents,which was engineered to dynamically respond to endogenous biochemical cues(e.g.,p H gradients,redox potential,enzyme activity,hypoxia environment) or exogenous physical triggers(e.g.,photoirradiation,thermal gradients,ultrasound(US)/magnetic stimuli).Through spatiotemporally controlled structural transformations,stimuli-responsive block copolymers enable precise contrast targeting,activatable signal amplification,and theranostic integration,thereby substantially enhancing signal-to-noise ratios of bioimaging and diagnostic specificity.Hence,this mini-review systematically examines molecular engineering principles for designing p H-,redox-,enzyme-,light-,thermo-,and US/magnetic-responsive polymers,with emphasis on structure-property relationships governing imaging performance modulation.Furthermore,we critically analyze emerging strategies for optical imaging,US synergies,and magnetic resonance imaging(MRI).Multimodal bioimaging has also been elaborated,which could overcome the inherent trade-offs between resolution,penetration depth,and functional specificity in single-modal approaches.By elucidating mechanistic insights and translational challenges,this mini-review aims to establish a design framework of stimuli-responsive block copolymersbased for high fidelity bioimaging agents and accelerate their clinical translation in precise diagnosis and therapy.展开更多
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L...The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.展开更多
Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conductin...Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conducting ECG-based studies.From a review of existing studies,two main factors appear to contribute to this problem:the uneven distribution of arrhythmia classes and the limited expressiveness of features learned by current models.To overcome these limitations,this study proposes a dual-path multimodal framework,termed DM-EHC(Dual-Path Multimodal ECG Heartbeat Classifier),for ECG-based heartbeat classification.The proposed framework links 1D ECG temporal features with 2D time–frequency features.By setting up the dual paths described above,the model can process more dimensions of feature information.The MIT-BIH arrhythmia database was selected as the baseline dataset for the experiments.Experimental results show that the proposed method outperforms single modalities and performs better for certain specific types of arrhythmias.The model achieved mean precision,recall,and F1 score of 95.14%,92.26%,and 93.65%,respectively.These results indicate that the framework is robust and has potential value in automated arrhythmia classification.展开更多
The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalit...The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalities.Existing multimodal fake news detection approaches often rely on deterministic fusion strategies,which limits their ability to model uncertainty and complex cross-modal dependencies.To address these challenges,we propose Q-ALIGNer,a quantum-inspired multimodal framework that integrates classical feature extraction with quantumstate encoding,learnable cross-modal entanglement,and robustness-aware training objectives.The proposed framework adopts quantumformalism as a representational abstraction,enabling probabilisticmodeling ofmultimodal alignment while remaining fully executable on classical hardware.Q-ALIGNer is evaluated on four widely used benchmark datasets—FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU—covering diverse platforms,languages,and content characteristics.Experimental results demonstrate consistent performance improvements over strong text-only,vision-only,multimodal,and quantum-inspired baselines,including BERT,RoBERTa,XLNet,ResNet,EfficientNet,ViT,Multimodal-BERT,ViLBERT,and QEMF.Q-ALIGNer achieves accuracies of 91.2%,92.9%,91.7%,and 92.1%on FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU,respectively,with F1-score gains of 3–4 percentage points over QEMF.Robustness evaluation shows a reduced adversarial accuracy gap of 2.6%,compared to 7%–9%for baseline models,while calibration analysis indicates improved reliability with an expected calibration error of 0.031.In addition,computational analysis shows that Q-ALIGNer reduces training time to 19.6 h compared to 48.2 h for QEMF at a comparable parameter scale.These results indicate that quantum-inspired alignment and entanglement can enhance robustness,uncertainty awareness,and efficiency in multimodal fake news detection,positioning Q-ALIGNer as a principled and practical content-centric framework for misinformation analysis.展开更多
Objective:Accurate detection of PIK3CA mutations is essential for guiding PI3K-targeted therapies in breast cancer,yet sequencing is not universally accessible,and single-modality prediction models have limited perfor...Objective:Accurate detection of PIK3CA mutations is essential for guiding PI3K-targeted therapies in breast cancer,yet sequencing is not universally accessible,and single-modality prediction models have limited performance.This study developed a multimodal deep learning framework integrating whole-slide imaging(WSI)and structured clinical data to improve mutation prediction.Methods:A total of 1,047 patients from TCGA and 166 patients from 3 external centers were included.The histopathology model used a transformer-based pretrained encoder(H-optimus-0)and a clustering-constrained attention multiple instance learning(CLAM-SB MIL)classifier to generate WSI-level representations.The clinical model incorporated engineered clinical variables and an extreme gradient boosting(XGBoost)model.A decision-level late fusion strategy(Multimodal PIK3CA Model,MPM)combined probabilistic outputs from both branches.Performance was evaluated with the area under the curve(AUC)and secondary metrics.Interpretability was assessed via attention heatmaps and shapley additive explanations(SHAP)analysis.Results:MPM outperformed single-modality models.It achieved an AUC of 0.745 on TCGA and maintained stable performance across external cohorts(0.695,0.690,and 0.680).SHAP analysis identified molecular subtype as the most influential clinical feature,whereas attention maps highlighted mutation-associated morphological regions.Conclusions:The developed multimodal framework effectively integrates complementary morphological and clinical information,and provides a robust and generalizable method for predicting PIK3CA mutation status.Strong multicenter adaptability and biological interpretability support its potential use as a clinical decision-support tool and an accessible alternative to molecular testing.展开更多
Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal ...Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal alignment,temporal consistency,and robust handling of noisy or incomplete inputs across multiple modalities.We propose Multi Agent-Chain of Thought(CoT),a novel multi-agent chain-of-thought reasoning framework where specialized agents for text,vision,and speech modalities collaboratively construct shared reasoning traces through inter-agent message passing and consensus voting mechanisms.Our architecture incorporates self-reflection modules,conflict resolution protocols,and dynamic rationale alignment to enhance consistency,factual accuracy,and user engagement.The framework employs a hierarchical attention mechanism with cross-modal fusion and implements adaptive reasoning depth based on dialogue complexity.Comprehensive evaluations on Situated Interactive Multi-Modal Conversations(SIMMC)2.0,VisDial v1.0,and newly introduced challenging scenarios demonstrate statistically significant improvements in grounding accuracy(p<0.01),chain-of-thought interpretability,and robustness to adversarial inputs compared to state-of-the-art monolithic transformer baselines and existing multi-agent approaches.展开更多
It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This stu...It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.展开更多
For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-base...For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-based classification to site-specific treatment guidelines.Yet,this framework catastrophically fails for a substantial population of patients diagnosed with cancer of unknown primary(CUP).These patients present metastatic disease,yet their primary tumors remain elusive despite exhaustive clinical workup1.CUP,accounting for 1%-3%of all cancer diagnoses,is an enigma with devastating consequences;the median overall survival is only 2-12 months2-4.The inability to pinpoint an origin forces clinicians to rely on broad-spectrum empirical chemotherapy,such as taxane-carboplatin regimens,which have limited efficacy and exclude patients from the promise of targeted therapies and clinical trials5.CUP is not only a diagnostic challenge but also an indictment of the siloed approach to understanding malignancy:this cancer highlights the limitations of origin-based diagnostic frameworks.However,the confluence of high-dimensional biological data and advanced artificial intelligence(AI)is now poised to address this long-standing diagnostic limitation and to herald a new era for not only CUP but also oncology as a whole(Figure 1).展开更多
High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging ...High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).展开更多
To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and ex...To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and expert experience,which limits their adaptability under variable operating conditions and strong noise environments,severely affecting the generalization capability of diagnostic models.To address this issue,this study proposes a multimodal fusion fault diagnosis framework based on Mel-spectrograms and automated machine learning(AutoML).The framework first extracts fault-sensitive Mel time–frequency features from acoustic signals and fuses them with statistical features of vibration signals to construct complementary fault representations.On this basis,automated machine learning techniques are introduced to enable end-to-end diagnostic workflow construction and optimal model configuration acquisition.Finally,diagnostic decisions are achieved by automatically integrating the predictions of multiple high-performance base models.Experimental results on a centrifugal pump vibration and acoustic dataset demonstrate that the proposed framework achieves high diagnostic accuracy under noise-free conditions and maintains strong robustness under noisy interference,validating its efficiency,scalability,and practical value for rotating machinery fault diagnosis.展开更多
Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness un...Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness under complex working conditions and a general lack of interpretability.To address these challenges,we propose a physics-informed multimodal fault diagnosis framework based on few-shot learning,which integrates a 2D timefrequency image encoder and a 1Dvibration signal encoder.Specifically,we embed prior knowledge ofmulti-resolution analysis from signal processing into the model by designing a Laplace Wavelet Convolution(LWC)module,which enhances interpretability since wavelet coefficients naturally correspond to specific frequency and temporal structures.To further balance the guidance of physical priors with the flexibility of learnable representations,we introduce a parametric multi-kernel wavelet that employs channel-wise dynamic attention to adaptively select relevant wavelet bases,thereby improving the feature expressiveness.Moreover,we develop a Mahalanobis-Prototype Joint Metric,which constructs more accurate and distribution-consistent decision boundaries under few-shot conditions.Comprehensive experiments on the Case Western Reserve University(CWRU)and Paderborn University(PU)bearing datasets demonstrate the superior effectiveness,robustness,and interpretability of the proposed approach compared with state-of-the-art baselines.展开更多
基金supported by the Institute of Information&Communications Technology Planning&Evaluation grant funded by the Korea government(MSIT)(No.RS-2021-II211341,AI Graduate School Support Program,Chung-Ang University)in part by the Institute of Information and Communications Technology Planning and Evaluation grant funded by the Korea government(MSIT)(Development of Integrated Development Framework that Supports Automatic Neural Network Generation and Deployment Optimized for Runtime Environment,Grant No.2021-0-00766).
文摘Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence.
基金supported in part by JSPS Grants-in-Aid for Scientific Research 25K07742 and 25K23457.
文摘Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks is robust sensing under the coexistence of low signal-to-noise ratios(SNRs)and non-Gaussian impulsive noise,where observations may be distorted differently across feature modalities,making conventional fusion unstable and degrading detection reliability.To address this challenge,the generalized Gaussian distribution(GGD)is adopted as the noise model,and a multimodal fusion framework termed BCAM-Net(bidirectional cross-attention multimodal network)is proposed.BCAM-Net adopts a parallel dual-branch architecture:a time-frequency branch that leverages the continuous wavelet transform(CWT)to extract time-frequency representations,and a temporal branch that learns long-range dependencies from raw signals.BCAM-Net utilizes a bidirectional cross-attention mechanism to achieve deep alignment and mutual calibration of temporal and time-frequency features,generating a fused representation that is highly robust to complex noise.Simulation results show that,under GGD noise with shape parameterβ=0.5,BCAM-Net achieves high detection probabilities in the low-SNR regime and outperforms representative baselines.At a false alarm probability Pf=0.1 and SNR of−14 dB,it attains a detection probability of 0.9020,exceeding the CNN-Transformer,WT-ResNet,TFCFN,and conventional CNN benchmarks by 5.75%,6.98%,33.3%,and 21.1%,respectively.These results indicate that BCAM-Net can effectively improve spectrum sensing performance in low-SNR impulsive-noise scenarios,and provides a lightweight,high-performance solution for practical cognitive radio spectrum sensing.
文摘Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical application of multi-omics parameters is still restricted by the expensive and less accessible assays,although they accurately reflect immune status.A comprehensive evaluation framework based on“easy-to-obtain”multi-model clinical parameters is urgently required,incorporating clinical features to establish baseline patient profiles and disease staging;routine blood tests assessing systemic metabolic and functional status;immune cell subsets quantifying subcluster dynamics;imaging features delineating tumor morphology,spatial configuration,and perilesional anatomical relationships;immunohistochemical markers positioning qualitative and quantitative detection of tumor antigens from the cellular and molecular level.This integrated phenomic approach aims to improve prognostic stratification and clinical decision-making in hepatocellular carcinoma management conveniently and practically.
基金supported by the National Social Science Fund of China(20BXW101).
文摘Detecting fake news in multimodal and multilingual social media environments is challenging due to inherent noise,inter-modal imbalance,computational bottlenecks,and semantic ambiguity.To address these issues,we propose SparseMoE-MFN,a novel unified framework that integrates sparse attention with a sparse-activated Mixture of-Experts(MoE)architecture.This framework aims to enhance the efficiency,inferential depth,and interpretability of multimodal fake news detection.Sparse MoE-MFN leverages LLaVA-v1.6-Mistral-7B-HF for efficient visual encoding and Qwen/Qwen2-7B for text processing.The sparse attention module adaptively filters irrelevant tokens and focuses on key regions,reducing computational costs and noise.The sparse MoE module dynamically routes inputs to specialized experts(visual,language,cross-modal alignment)based on content heterogeneity.This expert specialization design boosts computational efficiency and semantic adaptability,enabling precise processing of complex content and improving performance on ambiguous categories.Evaluated on the large-scale,multilingualMR2 dataset,SparseMoEMFN achieves state-of-the-art performance.It obtains an accuracy of 86.7%and a macro-averaged F1 score of 0.859,outperforming strong baselines like MiniGPT-4 by 3.4%and 3.2%,respectively.Notably,it shows significant advantages in the“unverified”category.Furthermore,SparseMoE-MFN demonstrates superior computational efficiency,with an average inference latency of 89.1 ms and 95.4 GFLOPs,substantially lower than existing models.Ablation studies and visualization analyses confirm the effectiveness of both sparse attention and sparse MoE components in improving accuracy,generalization,and efficiency.
基金funded by Fundación CajaCanarias and Fundación Bancaria“la Caixa”,grant number 2023DIG11.
文摘Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and require a high level of expertise.This article proposes an innovative alternative solution that overcomes these limitations by automatically generating comprehensive Business Process Modelling and Notation(BPMN)diagrams solely from verbal descriptions of the processes to be modeled,utilizing Large Language Models(LLMs)and multimodal Artificial Intelligence(AI).Experimental results,based on video recordings of process explanations provided by an expert from an organization(in this case,the Commercial Courts of a public justice administration),demonstrate that the proposed methodology successfully enables the automatic generation of complete and accurate BPMN diagrams,leading to significant improvements in the speed,accuracy,and accessibility of process modeling.This research makes a substantial contribution to the field of business process modeling,as its methodology is groundbreaking in its use of LLMs and multimodal AI capabilities to handle different types of source material(text and video),combining several tools to minimize the number of queries and reduce the complexity of the prompts required for the automatic generation of successful BPMN diagrams.
文摘Traditional artificial intelligence(AI)-based methods for breast cancer diagnosis often rely on a single modality,such as ultrasound images.With the rise of multimodal approaches,multiple data sources,including imaging from diverse medical modalities,structured clinical information,and unstructured medical reports,are increasingly integrated to provide richer and more informative signals for model training.This survey reviews the data modalities employed in AI-based breast cancer research,examines common multimodal combinations and fusion strategies,and discusses their applications across clinical tasks such as diagnosis,treatment planning,and outcome prediction.By consolidating current literature and identifying critical gaps,this survey aims to guide future research toward the development of reliable,clinically relevant multimodal AI systems for use in breast cancer management.
基金supported by the National Natural Science Foundation of China(62403345)the Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology(2024B1212010006)the Shanxi Provincial Department of Science and Technology Basic Research Project(202403021212174,202403021221074).
文摘Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios.
基金The authors extend their appreciation to Prince Sattam bin Abdulaziz University for funding this research work through the project number(PSAU/2024/01/32082).
文摘In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we develop a multimodal framework that integrates symbolic task reasoning with continuous trajectory generation.The approach employs transformer models and adversarial training to map high-level intent to robotic motion.Information from multiple data sources,such as voice traits,hand and body keypoints,visual observations,and recorded paths,is integrated simultaneously.These signals are mapped into a shared representation that supports interpretable reasoning while enabling smooth and realistic motion generation.Based on this design,two different learning strategies are investigated.In the first step,grammar-constrained Linear Temporal Logic(LTL)expressions are created from multimodal human inputs.These expressions are subsequently decoded into robot trajectories.The second method generates trajectories directly from symbolic intent and linguistic data,bypassing an intermediate logical representation.Transformer encoders combine multiple types of information,and autoregressive transformer decoders generate motion sequences.Adding smoothness and speed limits during training increases the likelihood of physical feasibility.To improve the realism and stability of the generated trajectories during training,an adversarial discriminator is also included to guide them toward the distribution of actual robot motion.Tests on the NATSGLD dataset indicate that the complete system exhibits stable training behaviour and performance.In normalised coordinates,the logic-based pipeline has an Average Displacement Error(ADE)of 0.040 and a Final Displacement Error(FDE)of 0.036.The adversarial generator makes substantially more progress,reducing ADE to 0.021 and FDE to 0.018.Visual examination confirms that the generated trajectories closely align with observed motion patterns while preserving smooth temporal dynamics.
基金Supported by Xuhui District Health Commission,No.SHXH202214.
文摘Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including computed tomography(CT),magnetic resonance imaging(MRI),endoscopic imaging,and genomic profiles-to enable intelligent decision-making for individualized therapy.This approach leverages AI algorithms to fuse imaging,endoscopic,and omics data,facilitating comprehensive characterization of tumor biology,prediction of treatment response,and optimization of therapeutic strategies.By combining CT and MRI for structural assessment,endoscopic data for real-time visual inspection,and genomic information for molecular profiling,multimodal AI enhances the accuracy of patient stratification and treatment personalization.The clinical implementation of this technology demonstrates potential for improving patient outcomes,advancing precision oncology,and supporting individualized care in gastrointestinal cancers.Ultimately,multimodal AI serves as a transformative tool in oncology,bridging data integration with clinical application to effectively tailor therapies.
基金supported by the National Natural Science Foundation of China (Nos.22208218,22078196,and 22278268)the Natural Science Foundation of Shanghai (No.22ZR1460400)Collaborative Innovation Center of Fragrance Flavour and Cosmetics,and Collaborative Innovation Project of Shanghai Institute of Technology (No.XTCX2023-07)。
文摘The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytotoxicity.These limitations have catalyzed the development of intelligent stimuli-responsive block copolymers-based bioimaging agents,which was engineered to dynamically respond to endogenous biochemical cues(e.g.,p H gradients,redox potential,enzyme activity,hypoxia environment) or exogenous physical triggers(e.g.,photoirradiation,thermal gradients,ultrasound(US)/magnetic stimuli).Through spatiotemporally controlled structural transformations,stimuli-responsive block copolymers enable precise contrast targeting,activatable signal amplification,and theranostic integration,thereby substantially enhancing signal-to-noise ratios of bioimaging and diagnostic specificity.Hence,this mini-review systematically examines molecular engineering principles for designing p H-,redox-,enzyme-,light-,thermo-,and US/magnetic-responsive polymers,with emphasis on structure-property relationships governing imaging performance modulation.Furthermore,we critically analyze emerging strategies for optical imaging,US synergies,and magnetic resonance imaging(MRI).Multimodal bioimaging has also been elaborated,which could overcome the inherent trade-offs between resolution,penetration depth,and functional specificity in single-modal approaches.By elucidating mechanistic insights and translational challenges,this mini-review aims to establish a design framework of stimuli-responsive block copolymersbased for high fidelity bioimaging agents and accelerate their clinical translation in precise diagnosis and therapy.
文摘The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news.
基金supported by the Innovative Human Resource Development for Local Intel-lectualization program through the Institute of Information&Communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(No.IITP-2026-2020-0-01741)the research fund of Hanyang University(HY-2025-1110).
文摘Arrhythmias are a frequently occurring phenomenon in clinical practice,but how to accurately dis-tinguish subtle rhythm abnormalities remains an ongoing difficulty faced by the entire research community when conducting ECG-based studies.From a review of existing studies,two main factors appear to contribute to this problem:the uneven distribution of arrhythmia classes and the limited expressiveness of features learned by current models.To overcome these limitations,this study proposes a dual-path multimodal framework,termed DM-EHC(Dual-Path Multimodal ECG Heartbeat Classifier),for ECG-based heartbeat classification.The proposed framework links 1D ECG temporal features with 2D time–frequency features.By setting up the dual paths described above,the model can process more dimensions of feature information.The MIT-BIH arrhythmia database was selected as the baseline dataset for the experiments.Experimental results show that the proposed method outperforms single modalities and performs better for certain specific types of arrhythmias.The model achieved mean precision,recall,and F1 score of 95.14%,92.26%,and 93.65%,respectively.These results indicate that the framework is robust and has potential value in automated arrhythmia classification.
基金Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2026R77)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia,the Deanship of Scientific Research at Northern Border University,Arar,Saudi Arabia,through the project number NBU-FFR-2026-2248-02.
文摘The rapid proliferation of multimodal misinformation on social media demands detection frameworks that are not only accurate but also robust to noise,adversarial manipulation,and semantic inconsistency between modalities.Existing multimodal fake news detection approaches often rely on deterministic fusion strategies,which limits their ability to model uncertainty and complex cross-modal dependencies.To address these challenges,we propose Q-ALIGNer,a quantum-inspired multimodal framework that integrates classical feature extraction with quantumstate encoding,learnable cross-modal entanglement,and robustness-aware training objectives.The proposed framework adopts quantumformalism as a representational abstraction,enabling probabilisticmodeling ofmultimodal alignment while remaining fully executable on classical hardware.Q-ALIGNer is evaluated on four widely used benchmark datasets—FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU—covering diverse platforms,languages,and content characteristics.Experimental results demonstrate consistent performance improvements over strong text-only,vision-only,multimodal,and quantum-inspired baselines,including BERT,RoBERTa,XLNet,ResNet,EfficientNet,ViT,Multimodal-BERT,ViLBERT,and QEMF.Q-ALIGNer achieves accuracies of 91.2%,92.9%,91.7%,and 92.1%on FakeNewsNet,Fakeddit,Weibo,and MediaEval VMU,respectively,with F1-score gains of 3–4 percentage points over QEMF.Robustness evaluation shows a reduced adversarial accuracy gap of 2.6%,compared to 7%–9%for baseline models,while calibration analysis indicates improved reliability with an expected calibration error of 0.031.In addition,computational analysis shows that Q-ALIGNer reduces training time to 19.6 h compared to 48.2 h for QEMF at a comparable parameter scale.These results indicate that quantum-inspired alignment and entanglement can enhance robustness,uncertainty awareness,and efficiency in multimodal fake news detection,positioning Q-ALIGNer as a principled and practical content-centric framework for misinformation analysis.
基金financially supported by the Hebei Natural Science Foundation(Grant No.H2024206504)the Medical Science Research Project of Hebei(Grant No.20260484,20260530)the Fundamental Research Funds for the Central Universities(Grant No.20822041J4123).
文摘Objective:Accurate detection of PIK3CA mutations is essential for guiding PI3K-targeted therapies in breast cancer,yet sequencing is not universally accessible,and single-modality prediction models have limited performance.This study developed a multimodal deep learning framework integrating whole-slide imaging(WSI)and structured clinical data to improve mutation prediction.Methods:A total of 1,047 patients from TCGA and 166 patients from 3 external centers were included.The histopathology model used a transformer-based pretrained encoder(H-optimus-0)and a clustering-constrained attention multiple instance learning(CLAM-SB MIL)classifier to generate WSI-level representations.The clinical model incorporated engineered clinical variables and an extreme gradient boosting(XGBoost)model.A decision-level late fusion strategy(Multimodal PIK3CA Model,MPM)combined probabilistic outputs from both branches.Performance was evaluated with the area under the curve(AUC)and secondary metrics.Interpretability was assessed via attention heatmaps and shapley additive explanations(SHAP)analysis.Results:MPM outperformed single-modality models.It achieved an AUC of 0.745 on TCGA and maintained stable performance across external cohorts(0.695,0.690,and 0.680).SHAP analysis identified molecular subtype as the most influential clinical feature,whereas attention maps highlighted mutation-associated morphological regions.Conclusions:The developed multimodal framework effectively integrates complementary morphological and clinical information,and provides a robust and generalizable method for predicting PIK3CA mutation status.Strong multicenter adaptability and biological interpretability support its potential use as a clinical decision-support tool and an accessible alternative to molecular testing.
文摘Multimodal dialogue systems often fail to maintain coherent reasoning over extended conversations and suffer from hallucination due to limited context modeling capabilities.Current approaches struggle with crossmodal alignment,temporal consistency,and robust handling of noisy or incomplete inputs across multiple modalities.We propose Multi Agent-Chain of Thought(CoT),a novel multi-agent chain-of-thought reasoning framework where specialized agents for text,vision,and speech modalities collaboratively construct shared reasoning traces through inter-agent message passing and consensus voting mechanisms.Our architecture incorporates self-reflection modules,conflict resolution protocols,and dynamic rationale alignment to enhance consistency,factual accuracy,and user engagement.The framework employs a hierarchical attention mechanism with cross-modal fusion and implements adaptive reasoning depth based on dialogue complexity.Comprehensive evaluations on Situated Interactive Multi-Modal Conversations(SIMMC)2.0,VisDial v1.0,and newly introduced challenging scenarios demonstrate statistically significant improvements in grounding accuracy(p<0.01),chain-of-thought interpretability,and robustness to adversarial inputs compared to state-of-the-art monolithic transformer baselines and existing multi-agent approaches.
基金supported by the DH2025-TN07-07 project conducted at the Thai Nguyen University of Information and Communication Technology,Thai Nguyen,Vietnam,with additional support from the AI in Software Engineering Lab.
文摘It remains difficult to automate the creation and validation of Unified Modeling Language(UML)dia-grams due to unstructured requirements,limited automated pipelines,and the lack of reliable evaluation methods.This study introduces a cohesive architecture that amalgamates requirement development,UML synthesis,and multimodal validation.First,LLaMA-3.2-1B-Instruct was utilized to generate user-focused requirements.Then,DeepSeek-R1-Distill-Qwen-32B applies its reasoning skills to transform these requirements into PlantUML code.Using this dual-LLM pipeline,we constructed a synthetic dataset of 11,997 UML diagrams spanning six major diagram families.Rendering analysis showed that 89.5%of the generated diagrams compile correctly,while invalid cases were detected automatically.To assess quality,we employed a multimodal scoring method that combines Qwen2.5-VL-3B,LLaMA-3.2-11B-Vision-Instruct and Aya-Vision-8B,with weights based on MMMU performance.A study with 94 experts revealed strong alignment between automatic and manual evaluations,yielding a Pearson correlation of r=0.82 and a Fleiss’Kappa of 0.78.This indicates a high degree of concordance between automated metrics and human judgment.Overall,the results demonstrated that our scoring system is effective and that the proposed generation pipeline produces UML diagrams that are both syntactically correct and semantically coherent.More broadly,the system provides a scalable and reproducible foundation for future work in AI-driven software modeling and multimodal verification.
基金supported by the National Natural Science Foundation of China(Grant Nos.32270688,31801117,and 82430107 to X.L.,and 32500589 to H.S.)the China Postdoctoral Science Foundation(Grant Nos.BX20240253 and 2024M762384 to H.S.)+1 种基金the Natural Science Foundation of Tianjin(Grant No.24JCQNJC01280 to H.S.)Tianjin Key Medical Discipline(Specialty)Construction Project(Grant No.TJYXZDXK-3-003A).
文摘For decades,the central dogma of oncology has been that a cancer’s identity is inextricably linked to its anatomical origin.This principle underpins the entire diagnostic and therapeutic framework,from histology-based classification to site-specific treatment guidelines.Yet,this framework catastrophically fails for a substantial population of patients diagnosed with cancer of unknown primary(CUP).These patients present metastatic disease,yet their primary tumors remain elusive despite exhaustive clinical workup1.CUP,accounting for 1%-3%of all cancer diagnoses,is an enigma with devastating consequences;the median overall survival is only 2-12 months2-4.The inability to pinpoint an origin forces clinicians to rely on broad-spectrum empirical chemotherapy,such as taxane-carboplatin regimens,which have limited efficacy and exclude patients from the promise of targeted therapies and clinical trials5.CUP is not only a diagnostic challenge but also an indictment of the siloed approach to understanding malignancy:this cancer highlights the limitations of origin-based diagnostic frameworks.However,the confluence of high-dimensional biological data and advanced artificial intelligence(AI)is now poised to address this long-standing diagnostic limitation and to herald a new era for not only CUP but also oncology as a whole(Figure 1).
文摘High-throughput transcriptomics has evolved from bulk RNA-seq to single-cell and spatial profiling,yet its clinical translation still depends on effective integration across diverse omics and data modalities.Emerging foundation models and multimodal learning frameworks are enabling scalable and transferable representations of cellular states,while advances in interpretability and real-world data integration are bridging the gap between discovery and clinical application.This paper outlines a concise roadmap for AI-driven,transcriptome-centered multi-omics integration in precision medicine(Figure 1).
基金supported in part by the National Natural Science Foundation of China under Grants 52475102 and 52205101in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515240021+1 种基金in part by the Young Talent Support Project of Guangzhou Association for Science and Technology(QT-2024-28)in part by the Youth Development Initiative of Guangdong Association for Science and Technology(SKXRC2025254).
文摘To ensure the safe and stable operation of rotating machinery,intelligent fault diagnosis methods hold significant research value.However,existing diagnostic approaches largely rely on manual feature extraction and expert experience,which limits their adaptability under variable operating conditions and strong noise environments,severely affecting the generalization capability of diagnostic models.To address this issue,this study proposes a multimodal fusion fault diagnosis framework based on Mel-spectrograms and automated machine learning(AutoML).The framework first extracts fault-sensitive Mel time–frequency features from acoustic signals and fuses them with statistical features of vibration signals to construct complementary fault representations.On this basis,automated machine learning techniques are introduced to enable end-to-end diagnostic workflow construction and optimal model configuration acquisition.Finally,diagnostic decisions are achieved by automatically integrating the predictions of multiple high-performance base models.Experimental results on a centrifugal pump vibration and acoustic dataset demonstrate that the proposed framework achieves high diagnostic accuracy under noise-free conditions and maintains strong robustness under noisy interference,validating its efficiency,scalability,and practical value for rotating machinery fault diagnosis.
文摘Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness under complex working conditions and a general lack of interpretability.To address these challenges,we propose a physics-informed multimodal fault diagnosis framework based on few-shot learning,which integrates a 2D timefrequency image encoder and a 1Dvibration signal encoder.Specifically,we embed prior knowledge ofmulti-resolution analysis from signal processing into the model by designing a Laplace Wavelet Convolution(LWC)module,which enhances interpretability since wavelet coefficients naturally correspond to specific frequency and temporal structures.To further balance the guidance of physical priors with the flexibility of learnable representations,we introduce a parametric multi-kernel wavelet that employs channel-wise dynamic attention to adaptively select relevant wavelet bases,thereby improving the feature expressiveness.Moreover,we develop a Mahalanobis-Prototype Joint Metric,which constructs more accurate and distribution-consistent decision boundaries under few-shot conditions.Comprehensive experiments on the Case Western Reserve University(CWRU)and Paderborn University(PU)bearing datasets demonstrate the superior effectiveness,robustness,and interpretability of the proposed approach compared with state-of-the-art baselines.