期刊文献+
共找到1,091篇文章
< 1 2 55 >
每页显示 20 50 100
Recent progress on artificial intelligence-enhanced multimodal sensors integrated devices and systems 被引量:2
1
作者 Haihua Wang Mingjian Zhou +5 位作者 Xiaolong Jia Hualong Wei Zhenjie Hu Wei Li Qiumeng Chen Lei Wang 《Journal of Semiconductors》 2025年第1期179-192,共14页
Multimodal sensor fusion can make full use of the advantages of various sensors,make up for the shortcomings of a single sensor,achieve information verification or information security through information redundancy,a... Multimodal sensor fusion can make full use of the advantages of various sensors,make up for the shortcomings of a single sensor,achieve information verification or information security through information redundancy,and improve the reliability and safety of the system.Artificial intelligence(AI),referring to the simulation of human intelligence in machines that are programmed to think and learn like humans,represents a pivotal frontier in modern scientific research.With the continuous development and promotion of AI technology in Sensor 4.0 age,multimodal sensor fusion is becoming more and more intelligent and automated,and is expected to go further in the future.With this context,this review article takes a comprehensive look at the recent progress on AI-enhanced multimodal sensors and their integrated devices and systems.Based on the concept and principle of sensor technologies and AI algorithms,the theoretical underpinnings,technological breakthroughs,and pragmatic applications of AI-enhanced multimodal sensors in various fields such as robotics,healthcare,and environmental monitoring are highlighted.Through a comparative study of the dual/tri-modal sensors with and without using AI technologies(especially machine learning and deep learning),AI-enhanced multimodal sensors highlight the potential of AI to improve sensor performance,data processing,and decision-making capabilities.Furthermore,the review analyzes the challenges and opportunities afforded by AI-enhanced multimodal sensors,and offers a prospective outlook on the forthcoming advancements. 展开更多
关键词 SENSOR multimodal sensors machine learning deep learning intelligent system
在线阅读 下载PDF
Medical multimodal large language models:A systematic review 被引量:1
2
作者 Yuan Hu Chenhan Xu +2 位作者 Bo Lin Weibin Yang Yuan Yan Tang 《Intelligent Oncology》 2025年第4期308-325,共18页
The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and ge... The rapid advancement of artificial intelligence(AI)has ushered in a new era of medical multimodal large language models(MLLMs),which integrate diverse data modalities such as text,imaging,physiological signals,and genomics to enhance clinical decision-making.This systematic review explores the core methodologies and applied research frontiers of medical MLLMs,focusing on their architecture,training methods,evaluation techniques,and applications.We highlight the transformative potential of MLLMs in achieving cross-modal semantic alignment,medical knowledge integration,and robust clinical reasoning.Despite their promise,challenges such as data heterogeneity,hallucination,and computational efficiency persist.By reviewing state-of-the-art solutions and future directions,this paper provides a comprehensive technical guide for developing reliable and interpretable medical MLLMs,ultimately aiming to bridge the gap between AI and clinical practice. 展开更多
关键词 multimodal large language model HALLUCINATION Medical multimodal dataset Clinical evaluation
在线阅读 下载PDF
Performance vs.Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
3
作者 Sarah M.Kamel Mai A.Fadel +1 位作者 Lamiaa Elrefaei Shimaa I.Hassan 《Computer Modeling in Engineering & Sciences》 2025年第4期373-411,共39页
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate... Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions. 展开更多
关键词 Arabic-VQA deep learning-based VQA deep multimodal information fusion multimodal representation learning VQA of yes/no questions VQA model complexity VQA model performance performance-complexity trade-off
在线阅读 下载PDF
Human Behaviour Classification in Emergency Situations Using Machine Learning with Multimodal Data:A Systematic Review(2020-2025)
4
作者 Mirza Murad Baig Muhammad Rehan Faheem +2 位作者 Lal Khan Hannan Adeel Syed Asim Ali Shah 《Computer Modeling in Engineering & Sciences》 2025年第12期2895-2935,共41页
With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(... With growing urban areas,the climate continues to change as a result of growing populations,and hence,the demand for better emergency response systems has become more important than ever.Human Behaviour Classi.cation(HBC)systems have started to play a vital role by analysing data from di.erent sources to detect signs of emergencies.These systems are being used inmany critical areas like healthcare,public safety,and disastermanagement to improve response time and to prepare ahead of time.But detecting human behaviour in such stressful conditions is not simple;it o.en comes with noisy data,missing information,and the need to react in real time.This review takes a deeper look at HBC research published between 2020 and 2025.and aims to answer.ve speci.c research questions.These questions cover the types of emergencies discussed in the literature,the datasets and sensors used,the e.ectiveness of machine learning(ML)and deep learning(DL)models,and the limitations that still exist in this.eld.We explored 120 papers that used di.erent types of datasets,some were based on sensor data,others on social media,and a few used hybrid approaches.Commonly used models included CNNs,LSTMs,and reinforcement learning methods to identify behaviours.Though a lot of progress has been made,the review found ongoing issues in combining sensors properly,reacting fast enough,and using more diverse datasets.Overall,from the.ndings we observed,the focus should be on building systems that use multiple sensors together,gather real-time data on a large scale,and produce results that are easier to interpret.Proper attention to privacy and ethical concerns needs to be addressed as well. 展开更多
关键词 Human behaviour ClassiThcation(HBC) public safety multimodal datasets privacy concerns missing information multi-sensor integration healthcare
在线阅读 下载PDF
A Special Issue:“Co-optimization and mechanism design of multimodal energy systems under carbon constraints”
5
作者 Lin Cheng Xiaojun Wan 《Global Energy Interconnection》 2025年第2期I0002-I0003,共2页
Against the backdrop of active global responses to climate change and the accelerated green and low-carbon energy transition,the co-optimization and innovative mechanism design of multimodal energy systems have become... Against the backdrop of active global responses to climate change and the accelerated green and low-carbon energy transition,the co-optimization and innovative mechanism design of multimodal energy systems have become a significant instrument for propelling the energy revolution and ensuring energy security.Under increasingly stringent carbon emission constraints,how to achieve multi-dimensional improvements in energy utilization efficiency,renewable energy accommodation levels,and system economics-through the intelligent coupling of diverse energy carriers such as electricity,heat,natural gas,and hydrogen,and the effective application of market-based instruments like carbon trading and demand response-constitutes a critical scientific and engineering challenge demanding urgent solutions. 展开更多
关键词 multimodal energy systems renewable energy accommodation energy utilization efficiency co optimization carbon constraints climate change carbon emission constraintshow mechanism design
在线阅读 下载PDF
Multimodal artificial intelligence system for detecting a small esophageal high-grade squamous intraepithelial neoplasia: A case report
6
作者 Yang Zhou Rui-De Liu +3 位作者 Hui Gong Xiang-Lei Yuan Bing Hu Zhi-Yin Huang 《World Journal of Gastrointestinal Endoscopy》 2025年第1期61-65,共5页
BACKGROUND Recent advancements in artificial intelligence(AI)have significantly enhanced the capabilities of endoscopic-assisted diagnosis for gastrointestinal diseases.AI has shown great promise in clinical practice,... BACKGROUND Recent advancements in artificial intelligence(AI)have significantly enhanced the capabilities of endoscopic-assisted diagnosis for gastrointestinal diseases.AI has shown great promise in clinical practice,particularly for diagnostic support,offering real-time insights into complex conditions such as esophageal squamous cell carcinoma.CASE SUMMARY In this study,we introduce a multimodal AI system that successfully identified and delineated a small and flat carcinoma during esophagogastroduodenoscopy,highlighting its potential for early detection of malignancies.The lesion was confirmed as high-grade squamous intraepithelial neoplasia,with pathology results supporting the AI system’s accuracy.The multimodal AI system offers an integrated solution that provides real-time,accurate diagnostic information directly within the endoscopic device interface,allowing for single-monitor use without disrupting endoscopist’s workflow.CONCLUSION This work underscores the transformative potential of AI to enhance endoscopic diagnosis by enabling earlier,more accurate interventions. 展开更多
关键词 Artificial intelligence multimodal artificial intelligence system Esophageal squamous cell carcinoma High-grade intraepithelial neoplasia Case report
暂未订购
A Root Cause Analysis Framework for Microservice Systems with Multimodal Data
7
作者 LI Yingke HAN Jing +2 位作者 SUN Yongqian SHI Binpeng GONG Zican 《ZTE Communications》 2025年第4期110-119,共10页
In recent years,microservice architecture has gained increasing popularity.However,due to the complex and dynamically chang⁃ing nature of microservice systems,failure detection has become more challenging.Traditional ... In recent years,microservice architecture has gained increasing popularity.However,due to the complex and dynamically chang⁃ing nature of microservice systems,failure detection has become more challenging.Traditional root cause analysis methods mostly rely on a single modality of data,which is insufficient to cover all failure information.Existing multimodal methods require collecting high-quality la⁃beled samples and often face challenges in classifying unknown failure categories.To address these challenges,this paper proposes a root cause analysis framework based on a masked graph autoencoder(GAE).The main process involves feature extraction,feature dimensionality reduction based on GAE,and online clustering combined with expert input.The method is experimentally evaluated on two public datasets and compared with two baseline methods,demonstrating significant advantages even with 16%labeled samples. 展开更多
关键词 root cause analysis multimodal data self-supervised learning online clustering
在线阅读 下载PDF
A Survey on Multimodal Emotion Recognition:Methods,Datasets,and Future Directions
8
作者 A-Seong Moon Haesung Kim +1 位作者 Ye-Chan Park Jaesung Lee 《Computers, Materials & Continua》 2026年第5期1-42,共42页
Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approa... Multimodal emotion recognition has emerged as a key research area for enabling human-centered artificial intelligence,supported by the rapid progress in vision,audio,language,and physiological modeling.Existing approaches integrate heterogeneous affective cues through diverse embedding strategies and fusion mechanisms,yet the field remains fragmented due to differences in feature alignment,temporal synchronization,modality reliability,and robustness to noise or missing inputs.This survey provides a comprehensive analysis of MER research from 2021 to 2025,consolidating advances in modality-specific representation learning,cross-modal feature construction,and early,late,and hybrid fusion paradigms.We systematically review visual,acoustic,textual,and sensor-based embeddings,highlighting howpre-trained encoders,self-supervised learning,and large languagemodels have reshaped the representational foundations ofMER.We further categorize fusion strategies by interaction depth and architectural design,examining how attention mechanisms,cross-modal transformers,adaptive gating,and multimodal large language models redefine the integration of affective signals.Finally,we summarize major benchmark datasets and evaluation metrics and discuss emerging challenges related to scalability,generalization,and interpretability.This survey aims to provide a unified perspective onmultimodal fusion for emotion recognition and to guide future research toward more coherent and generalizable multimodal affective intelligence. 展开更多
关键词 multimodal emotion recognition multimodal learning cross-modal learning fusion strategies representation learning
在线阅读 下载PDF
BCAM-Net:A Bidirectional Cross-Attention Multimodal Network for IoT Spectrum Sensing under Generalized Gaussian Noise
9
作者 Yuzhou Han Zhuoran Li +2 位作者 Ahmad Gendia Teruji Ide Osamu Muta 《Computers, Materials & Continua》 2026年第5期272-297,共26页
Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks ... Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks is robust sensing under the coexistence of low signal-to-noise ratios(SNRs)and non-Gaussian impulsive noise,where observations may be distorted differently across feature modalities,making conventional fusion unstable and degrading detection reliability.To address this challenge,the generalized Gaussian distribution(GGD)is adopted as the noise model,and a multimodal fusion framework termed BCAM-Net(bidirectional cross-attention multimodal network)is proposed.BCAM-Net adopts a parallel dual-branch architecture:a time-frequency branch that leverages the continuous wavelet transform(CWT)to extract time-frequency representations,and a temporal branch that learns long-range dependencies from raw signals.BCAM-Net utilizes a bidirectional cross-attention mechanism to achieve deep alignment and mutual calibration of temporal and time-frequency features,generating a fused representation that is highly robust to complex noise.Simulation results show that,under GGD noise with shape parameterβ=0.5,BCAM-Net achieves high detection probabilities in the low-SNR regime and outperforms representative baselines.At a false alarm probability Pf=0.1 and SNR of−14 dB,it attains a detection probability of 0.9020,exceeding the CNN-Transformer,WT-ResNet,TFCFN,and conventional CNN benchmarks by 5.75%,6.98%,33.3%,and 21.1%,respectively.These results indicate that BCAM-Net can effectively improve spectrum sensing performance in low-SNR impulsive-noise scenarios,and provides a lightweight,high-performance solution for practical cognitive radio spectrum sensing. 展开更多
关键词 Cognitive radio spectrumsensing IOT deep learning bidirectional cross-attention multimodal fusion
在线阅读 下载PDF
Multimodal clinical parameters-based immune status associated with the prognosis in patients with hepatocellular carcinoma
10
作者 Yu-Zhou Zhang Yuan-Ze Tang +4 位作者 Yun-Xuan He Shu-Tong Pan Hao-Cheng Dai Yu Liu Hai-Feng Zhou 《World Journal of Gastrointestinal Oncology》 2026年第1期75-91,共17页
Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical applicati... Hepatocellular carcinoma presents with three distinct immune phenotypes,including immune-desert,immune-excluded,and immune-inflamed,indicating various treatment responses and prognostic outcomes.The clinical application of multi-omics parameters is still restricted by the expensive and less accessible assays,although they accurately reflect immune status.A comprehensive evaluation framework based on“easy-to-obtain”multi-model clinical parameters is urgently required,incorporating clinical features to establish baseline patient profiles and disease staging;routine blood tests assessing systemic metabolic and functional status;immune cell subsets quantifying subcluster dynamics;imaging features delineating tumor morphology,spatial configuration,and perilesional anatomical relationships;immunohistochemical markers positioning qualitative and quantitative detection of tumor antigens from the cellular and molecular level.This integrated phenomic approach aims to improve prognostic stratification and clinical decision-making in hepatocellular carcinoma management conveniently and practically. 展开更多
关键词 Hepatocellular carcinoma Immune status PHENOTYPE multimodal parameters PROGNOSIS
暂未订购
Flexible Tactile Sensing Systems:Challenges in Theoretical Research Transferring to Practical Applications
11
作者 Zhiyu Yao Wenjie Wu +6 位作者 Fengxian Gao Min Gong Liang Zhang Dongrui Wang Baochun Guo Liqun Zhang Xiang Lin 《Nano-Micro Letters》 2026年第2期19-87,共69页
Since the first design of tactile sensors was proposed by Harmon in 1982,tactile sensors have evolved through four key phases:industrial applications(1980s,basic pressure detection),miniaturization via MEMS(1990s),fle... Since the first design of tactile sensors was proposed by Harmon in 1982,tactile sensors have evolved through four key phases:industrial applications(1980s,basic pressure detection),miniaturization via MEMS(1990s),flexible electronics(2010s,stretchable materials),and intelligent systems(2020s-present,AI-driven multimodal sensing).With the innovation of material,processing techniques,and multimodal fusion of stimuli,the application of tactile sensors has been continuously expanding to a diversity of areas,including but not limited to medical care,aerospace,sports and intelligent robots.Currently,researchers are dedicated to develop tactile sensors with emerging mechanisms and structures,pursuing high-sensitivity,high-resolution,and multimodal characteristics and further constructing tactile systems which imitate and approach the performance of human organs.However,challenges in the combination between the theoretical research and the practical applications are still significant.There is a lack of comprehensive understanding in the state of the art of such knowledge transferring from academic work to technical products.Scaled-up production of laboratory materials faces fatal challenges like high costs,small scale,and inconsistent quality.Ambient factors,such as temperature,humidity,and electromagnetic interference,also impair signal reliability.Moreover,tactile sensors must operate across a wide pressure range(0.1 k Pa to several or even dozens of MPa)to meet diverse application needs.Meanwhile,the existing algorithms,data models and sensing systems commonly reveal insufficient precision as well as undesired robustness in data processing,and there is a realistic gap between the designed and the demanded system response speed.In this review,oriented by the design requirements of intelligent tactile sensing systems,we summarize the common sensing mechanisms,inspired structures,key performance,and optimizing strategies,followed by a brief overview of the recent advances in the perspectives of system integration and algorithm implementation,and the possible roadmap of future development of tactile sensors,providing a forward-looking as well as critical discussions in the future industrial applications of flexible tactile sensors. 展开更多
关键词 Tactile sensation FLEXIBILITY multimodal system integration Robotic haptics
在线阅读 下载PDF
SparseMoE-MFN:A Sparse Attention and Mixture-of-Experts Framework for Multimodal Fake News Detection on Social Media
12
作者 Yuechuan Zhang Mingshu Zhang +2 位作者 Bin Wei Hongyu Jin Yaxuan Wang 《Computers, Materials & Continua》 2026年第5期1646-1669,共24页
Detecting fake news in multimodal and multilingual social media environments is challenging due to inherent noise,inter-modal imbalance,computational bottlenecks,and semantic ambiguity.To address these issues,we propo... Detecting fake news in multimodal and multilingual social media environments is challenging due to inherent noise,inter-modal imbalance,computational bottlenecks,and semantic ambiguity.To address these issues,we propose SparseMoE-MFN,a novel unified framework that integrates sparse attention with a sparse-activated Mixture of-Experts(MoE)architecture.This framework aims to enhance the efficiency,inferential depth,and interpretability of multimodal fake news detection.Sparse MoE-MFN leverages LLaVA-v1.6-Mistral-7B-HF for efficient visual encoding and Qwen/Qwen2-7B for text processing.The sparse attention module adaptively filters irrelevant tokens and focuses on key regions,reducing computational costs and noise.The sparse MoE module dynamically routes inputs to specialized experts(visual,language,cross-modal alignment)based on content heterogeneity.This expert specialization design boosts computational efficiency and semantic adaptability,enabling precise processing of complex content and improving performance on ambiguous categories.Evaluated on the large-scale,multilingualMR2 dataset,SparseMoEMFN achieves state-of-the-art performance.It obtains an accuracy of 86.7%and a macro-averaged F1 score of 0.859,outperforming strong baselines like MiniGPT-4 by 3.4%and 3.2%,respectively.Notably,it shows significant advantages in the“unverified”category.Furthermore,SparseMoE-MFN demonstrates superior computational efficiency,with an average inference latency of 89.1 ms and 95.4 GFLOPs,substantially lower than existing models.Ablation studies and visualization analyses confirm the effectiveness of both sparse attention and sparse MoE components in improving accuracy,generalization,and efficiency. 展开更多
关键词 Fake news detection multimodal sparse attention mixture-of-experts INTERPRETABILITY computational efficiency
在线阅读 下载PDF
Transformation of Verbal Descriptions of Process Flows into Business Process Modelling and Notation Models Using Multimodal Artificial Intelligence:Application in Justice
13
作者 Silvia Alayón Carlos Martín +3 位作者 Jesús Torres Manuel Bacallado Rosa Aguilar Guzmán Savirón 《Computer Modeling in Engineering & Sciences》 2026年第2期870-892,共23页
Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and requir... Business Process Modelling(BPM)is essential for analyzing,improving,and automating the flow of information within organizations,but traditional approaches based on manual interpretation are slow,error-prone,and require a high level of expertise.This article proposes an innovative alternative solution that overcomes these limitations by automatically generating comprehensive Business Process Modelling and Notation(BPMN)diagrams solely from verbal descriptions of the processes to be modeled,utilizing Large Language Models(LLMs)and multimodal Artificial Intelligence(AI).Experimental results,based on video recordings of process explanations provided by an expert from an organization(in this case,the Commercial Courts of a public justice administration),demonstrate that the proposed methodology successfully enables the automatic generation of complete and accurate BPMN diagrams,leading to significant improvements in the speed,accuracy,and accessibility of process modeling.This research makes a substantial contribution to the field of business process modeling,as its methodology is groundbreaking in its use of LLMs and multimodal AI capabilities to handle different types of source material(text and video),combining several tools to minimize the number of queries and reduce the complexity of the prompts required for the automatic generation of successful BPMN diagrams. 展开更多
关键词 Process modelling verbal description BPMN LLM multimodal AI
在线阅读 下载PDF
Multimodal medical imaging AI for breast cancer diagnosis: A comprehensive review
14
作者 Ting-Ruen Wei Yuling Yan 《Intelligent Oncology》 2026年第1期40-50,共11页
Traditional artificial intelligence(AI)-based methods for breast cancer diagnosis often rely on a single modality,such as ultrasound images.With the rise of multimodal approaches,multiple data sources,including imagin... Traditional artificial intelligence(AI)-based methods for breast cancer diagnosis often rely on a single modality,such as ultrasound images.With the rise of multimodal approaches,multiple data sources,including imaging from diverse medical modalities,structured clinical information,and unstructured medical reports,are increasingly integrated to provide richer and more informative signals for model training.This survey reviews the data modalities employed in AI-based breast cancer research,examines common multimodal combinations and fusion strategies,and discusses their applications across clinical tasks such as diagnosis,treatment planning,and outcome prediction.By consolidating current literature and identifying critical gaps,this survey aims to guide future research toward the development of reliable,clinically relevant multimodal AI systems for use in breast cancer management. 展开更多
关键词 Breast cancer Artificial intelligence Machine learning Deep learning multimodal
在线阅读 下载PDF
AVCLNet:Multimodal Multispeaker Tracking Network Using Audio-Visual Contrastive Learning
15
作者 Yihan Li Yidi Li +3 位作者 Zhenhuan Xu Hao Guo Mengyuan Liu Weiwei Wan 《CAAI Transactions on Intelligence Technology》 2026年第1期238-255,共18页
Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and rob... Audio-visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms.Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking.However,in complex multispeaker tracking scenarios,critical challenges such as cross-modal feature discrepancy,weak sound source localisation ambiguity and frequent identity switch errors remain unresolved,which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories.To this end,this paper proposes a multimodal multispeaker tracking network using audio-visual contrastive learning(AVCLNet).By integrating heterogeneous modal representations into a unified space through audio-visual contrastive learning,which facilitates cross-modal feature alignment,mitigates cross-modal feature bias and enhances identity-consistent representations.In the audio-visual measurement stage,we design a vision-guided weak sound source weighted enhancement method,which leverages visual cues to establish cross-modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources.Furthermore,in the data association phase,a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information,reducing frequent identity switch errors.Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state-of-the-art methods,demonstrating superior robustness in multispeaker scenarios. 展开更多
关键词 computer vision machine perception multimodal approaches pattern recognition video signal processing
在线阅读 下载PDF
Multimodal Trajectory Generation for Robotic Motion Planning Using Transformer-Based Fusion and Adversarial Learning
16
作者 Shtwai Alsubai Ahmad Almadhor +3 位作者 Abdullah Al Hejaili Najib Ben Aoun Tahani Alsubait Vincent Karovic 《Computer Modeling in Engineering & Sciences》 2026年第2期848-869,共22页
In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we devel... In Human–Robot Interaction(HRI),generating robot trajectories that accurately reflect user intentions while ensuring physical realism remains challenging,especially in unstructured environments.In this study,we develop a multimodal framework that integrates symbolic task reasoning with continuous trajectory generation.The approach employs transformer models and adversarial training to map high-level intent to robotic motion.Information from multiple data sources,such as voice traits,hand and body keypoints,visual observations,and recorded paths,is integrated simultaneously.These signals are mapped into a shared representation that supports interpretable reasoning while enabling smooth and realistic motion generation.Based on this design,two different learning strategies are investigated.In the first step,grammar-constrained Linear Temporal Logic(LTL)expressions are created from multimodal human inputs.These expressions are subsequently decoded into robot trajectories.The second method generates trajectories directly from symbolic intent and linguistic data,bypassing an intermediate logical representation.Transformer encoders combine multiple types of information,and autoregressive transformer decoders generate motion sequences.Adding smoothness and speed limits during training increases the likelihood of physical feasibility.To improve the realism and stability of the generated trajectories during training,an adversarial discriminator is also included to guide them toward the distribution of actual robot motion.Tests on the NATSGLD dataset indicate that the complete system exhibits stable training behaviour and performance.In normalised coordinates,the logic-based pipeline has an Average Displacement Error(ADE)of 0.040 and a Final Displacement Error(FDE)of 0.036.The adversarial generator makes substantially more progress,reducing ADE to 0.021 and FDE to 0.018.Visual examination confirms that the generated trajectories closely align with observed motion patterns while preserving smooth temporal dynamics. 展开更多
关键词 multimodal trajectory generation robotic motion planning transformer networks sensor fusion reinforcement learning generative adversarial networks
在线阅读 下载PDF
Multimodal artificial intelligence integrates imaging,endoscopic,and omics data for intelligent decision-making in individualized gastrointestinal tumor treatment
17
作者 Hui Nian Yi-Bin Wu +5 位作者 Yu Bai Zhi-Long Zhang Xiao-Huang Tu Qi-Zhi Liu De-Hua Zhou Qian-Cheng Du 《Artificial Intelligence in Gastroenterology》 2026年第1期1-19,共19页
Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including ... Gastrointestinal tumors require personalized treatment strategies due to their heterogeneity and complexity.Multimodal artificial intelligence(AI)addresses this challenge by integrating diverse data sources-including computed tomography(CT),magnetic resonance imaging(MRI),endoscopic imaging,and genomic profiles-to enable intelligent decision-making for individualized therapy.This approach leverages AI algorithms to fuse imaging,endoscopic,and omics data,facilitating comprehensive characterization of tumor biology,prediction of treatment response,and optimization of therapeutic strategies.By combining CT and MRI for structural assessment,endoscopic data for real-time visual inspection,and genomic information for molecular profiling,multimodal AI enhances the accuracy of patient stratification and treatment personalization.The clinical implementation of this technology demonstrates potential for improving patient outcomes,advancing precision oncology,and supporting individualized care in gastrointestinal cancers.Ultimately,multimodal AI serves as a transformative tool in oncology,bridging data integration with clinical application to effectively tailor therapies. 展开更多
关键词 multimodal artificial intelligence Gastrointestinal tumors Individualized therapy Intelligent diagnosis Treatment optimization Prognostic prediction Data fusion Deep learning Precision medicine
在线阅读 下载PDF
Engineering stimuli-responsive block copolymers for multimodal bioimaging
18
作者 Lizhuang Zhong Ming Liu +3 位作者 Shilong Su Dongxin Zeng Jing Hu Zhiqian Guo 《Chinese Chemical Letters》 2026年第1期116-124,共9页
The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytot... The diagnostic efficacy of contemporary bioimaging technologies remains constrained by inherent limitations of conventional imaging agents,including suboptimal sensitivity,off-target biodistribution,and inherent cytotoxicity.These limitations have catalyzed the development of intelligent stimuli-responsive block copolymers-based bioimaging agents,which was engineered to dynamically respond to endogenous biochemical cues(e.g.,p H gradients,redox potential,enzyme activity,hypoxia environment) or exogenous physical triggers(e.g.,photoirradiation,thermal gradients,ultrasound(US)/magnetic stimuli).Through spatiotemporally controlled structural transformations,stimuli-responsive block copolymers enable precise contrast targeting,activatable signal amplification,and theranostic integration,thereby substantially enhancing signal-to-noise ratios of bioimaging and diagnostic specificity.Hence,this mini-review systematically examines molecular engineering principles for designing p H-,redox-,enzyme-,light-,thermo-,and US/magnetic-responsive polymers,with emphasis on structure-property relationships governing imaging performance modulation.Furthermore,we critically analyze emerging strategies for optical imaging,US synergies,and magnetic resonance imaging(MRI).Multimodal bioimaging has also been elaborated,which could overcome the inherent trade-offs between resolution,penetration depth,and functional specificity in single-modal approaches.By elucidating mechanistic insights and translational challenges,this mini-review aims to establish a design framework of stimuli-responsive block copolymersbased for high fidelity bioimaging agents and accelerate their clinical translation in precise diagnosis and therapy. 展开更多
关键词 STIMULI-RESPONSIVE Block copolymers Molecular engineering multimodal bioimaging Diagnosis and therapy
原文传递
LLM-Powered Multimodal Reasoning for Fake News Detection
19
作者 Md.Ahsan Habib Md.Anwar Hussen Wadud +1 位作者 M.F.Mridha Md.Jakir Hossen 《Computers, Materials & Continua》 2026年第4期1821-1864,共44页
The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(L... The problem of fake news detection(FND)is becoming increasingly important in the field of natural language processing(NLP)because of the rapid dissemination of misleading information on the web.Large language models(LLMs)such as GPT-4.Zero excels in natural language understanding tasks but can still struggle to distinguish between fact and fiction,particularly when applied in the wild.However,a key challenge of existing FND methods is that they only consider unimodal data(e.g.,images),while more detailed multimodal data(e.g.,user behaviour,temporal dynamics)is neglected,and the latter is crucial for full-context understanding.To overcome these limitations,we introduce M3-FND(Multimodal Misinformation Mitigation for False News Detection),a novel methodological framework that integrates LLMs with multimodal data sources to perform context-aware veracity assessments.Our method proposes a hybrid system that combines image-text alignment,user credibility profiling,and temporal pattern recognition,which is also strengthened through a natural feedback loop that provides real-time feedback for correcting downstream errors.We use contextual reinforcement learning to schedule prompt updating and update the classifier threshold based on the latest multimodal input,which enables the model to better adapt to changing misinformation attack strategies.M3-FND is tested on three diverse datasets,FakeNewsNet,Twitter15,andWeibo,which contain both text and visual socialmedia content.Experiments showthatM3-FND significantly outperforms conventional and LLMbased baselines in terms of accuracy,F1-score,and AUC on all benchmarks.Our results indicate the importance of employing multimodal cues and adaptive learning for effective and timely detection of fake news. 展开更多
关键词 Fake news detection multimodal learning large language models prompt engineering instruction tuning reinforcement learning misinformation mitigation
在线阅读 下载PDF
上一页 1 2 55 下一页 到第
使用帮助 返回顶部