期刊文献+
共找到55篇文章
< 1 2 3 >
每页显示 20 50 100
TGNet:Intelligent Identification of Thunderstorm Wind Gusts Using Multimodal Fusion 被引量:2
1
作者 Xiaowen ZHANG Yongguang ZHENG +3 位作者 Hengde ZHANG Jie SHENG Bingjian LU Shuo FENG 《Advances in Atmospheric Sciences》 2025年第1期146-164,共19页
Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.There... Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.Therefore,it is necessary to establish thunderstorm wind gust identification techniques based on multisource high-resolution observations.This paper introduces a new algorithm,called thunderstorm wind gust identification network(TGNet).It leverages multimodal feature fusion to fuse the temporal and spatial features of thunderstorm wind gust events.The shapelet transform is first used to extract the temporal features of wind speeds from automatic weather stations,which is aimed at distinguishing thunderstorm wind gusts from those caused by synoptic-scale systems or typhoons.Then,the encoder,structured upon the U-shaped network(U-Net)and incorporating recurrent residual convolutional blocks(R2U-Net),is employed to extract the corresponding spatial convective characteristics of satellite,radar,and lightning observations.Finally,by using the multimodal deep fusion module based on multi-head cross-attention,the temporal features of wind speed at each automatic weather station are incorporated into the spatial features to obtain 10-minutely classification of thunderstorm wind gusts.TGNet products have high accuracy,with a critical success index reaching 0.77.Compared with those of U-Net and R2U-Net,the false alarm rate of TGNet products decreases by 31.28%and 24.15%,respectively.The new algorithm provides grid products of thunderstorm wind gusts with a spatial resolution of 0.01°,updated every 10minutes.The results are finer and more accurate,thereby helping to improve the accuracy of operational warnings for thunderstorm wind gusts. 展开更多
关键词 thunderstorm wind gusts shapelet transform multimodal deep feature fusion
在线阅读 下载PDF
Multimodal fusion recognition for digital twin
2
作者 Tianzhe Zhou Xuguang Zhang +1 位作者 Bing Kang Mingkai Chen 《Digital Communications and Networks》 SCIE CSCD 2024年第2期337-346,共10页
The digital twin is the concept of transcending reality,which is the reverse feedback from the real physical space to the virtual digital space.People hold great prospects for this emerging technology.In order to real... The digital twin is the concept of transcending reality,which is the reverse feedback from the real physical space to the virtual digital space.People hold great prospects for this emerging technology.In order to realize the upgrading of the digital twin industrial chain,it is urgent to introduce more modalities,such as vision,haptics,hearing and smell,into the virtual digital space,which assists physical entities and virtual objects in creating a closer connection.Therefore,perceptual understanding and object recognition have become an urgent hot topic in the digital twin.Existing surface material classification schemes often achieve recognition through machine learning or deep learning in a single modality,ignoring the complementarity between multiple modalities.In order to overcome this dilemma,we propose a multimodal fusion network in our article that combines two modalities,visual and haptic,for surface material recognition.On the one hand,the network makes full use of the potential correlations between multiple modalities to deeply mine the modal semantics and complete the data mapping.On the other hand,the network is extensible and can be used as a universal architecture to include more modalities.Experiments show that the constructed multimodal fusion network can achieve 99.42%classification accuracy while reducing complexity. 展开更多
关键词 Digital twin multimodal fusion Object recognition Deep learning Transfer learning
在线阅读 下载PDF
A deep multimodal fusion and multitasking trajectory prediction model for typhoon trajectory prediction to reduce flight scheduling cancellation
3
作者 TANG Jun QIN Wanting +1 位作者 PAN Qingtao LAO Songyang 《Journal of Systems Engineering and Electronics》 SCIE CSCD 2024年第3期666-678,共13页
Natural events have had a significant impact on overall flight activity,and the aviation industry plays a vital role in helping society cope with the impact of these events.As one of the most impactful weather typhoon... Natural events have had a significant impact on overall flight activity,and the aviation industry plays a vital role in helping society cope with the impact of these events.As one of the most impactful weather typhoon seasons appears and continues,airlines operating in threatened areas and passengers having travel plans during this time period will pay close attention to the development of tropical storms.This paper proposes a deep multimodal fusion and multitasking trajectory prediction model that can improve the reliability of typhoon trajectory prediction and reduce the quantity of flight scheduling cancellation.The deep multimodal fusion module is formed by deep fusion of the feature output by multiple submodal fusion modules,and the multitask generation module uses longitude and latitude as two related tasks for simultaneous prediction.With more dependable data accuracy,problems can be analysed rapidly and more efficiently,enabling better decision-making with a proactive versus reactive posture.When multiple modalities coexist,features can be extracted from them simultaneously to supplement each other’s information.An actual case study,the typhoon Lichma that swept China in 2019,has demonstrated that the algorithm can effectively reduce the number of unnecessary flight cancellations compared to existing flight scheduling and assist the new generation of flight scheduling systems under extreme weather. 展开更多
关键词 flight scheduling optimization deep multimodal fusion multitasking trajectory prediction typhoon weather flight cancellation prediction reliability
在线阅读 下载PDF
Component recognition of ISAR targets via multimodal feature fusion
4
作者 Chenxuan LI Weigang ZHU +2 位作者 Wei QU Fanyin MA Rundong WANG 《Chinese Journal of Aeronautics》 2025年第2期256-273,共18页
Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of IS... Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of ISAR systems.Also,data scarcity poses a greater challenge to the accurate recognition of components.To address the issues of component recognition in complex ISAR targets,this paper adopts semantic segmentation and proposes a few-shot semantic segmentation framework fusing multimodal features.The scarcity of available data is mitigated by using a two-branch scattering feature encoding structure.Then,the high-resolution features are obtained by fusing the ISAR image texture features and scattering quantization information of complex-valued echoes,thereby achieving significantly higher structural adaptability.Meanwhile,the scattering trait enhancement module and the statistical quantification module are designed.The edge texture is enhanced based on the scatter quantization property,which alleviates the segmentation challenge of edge blurring under low SNR conditions.The coupling of query/support samples is enhanced through four-dimensional convolution.Additionally,to overcome fusion challenges caused by information differences,multimodal feature fusion is guided by equilibrium comprehension loss.In this way,the performance potential of the fusion framework is fully unleashed,and the decision risk is effectively reduced.Experiments demonstrate the great advantages of the proposed framework in multimodal feature fusion,and it still exhibits great component segmentation capability under low SNR/edge blurring conditions. 展开更多
关键词 Few-shot Semantic segmentation Inverse Synthetic Aperture Radar(ISAR) SCATTERING multimodal fusion
原文传递
TGFN-SD:A text-guided multimodal fusion network for swine disease diagnosis
5
作者 Gan Yang Qifeng Li +5 位作者 Chunjiang Zhao Chaoyuan Wang Hua Yan Rui Meng Yu Liu Ligen Yu 《Artificial Intelligence in Agriculture》 2025年第2期266-279,共14页
China is the world's largest producer of pigs,but traditional manual prevention,treatment,and diagnosis methods cannot satisfy the demands of the current intensive production environment.Existing computer-aided di... China is the world's largest producer of pigs,but traditional manual prevention,treatment,and diagnosis methods cannot satisfy the demands of the current intensive production environment.Existing computer-aided diagnosis(CAD)systems for pigs are dominated by expert systems,which cannot be widely applied because the collection and maintenance of knowledge is difficult,and most of them ignore the effect of multimodal information.A swine disease diagnosis model was proposed in this study,the Text-Guided Fusion Network-Swine Diagnosis(TGFN-SD)model,which integrated text case reports and disease images.The model integrated the differences and complementary information in the multimodal representation of diseases through the text-guided transformer module such that text case reports could carry the semantic information of disease images for disease identification.Moreover,it alleviated the phenotypic overlap problem caused by similar diseases in combination with supervised learning and self-supervised learning.Experimental results revealed that TGFN-SD achieved satisfactory performance on a constructed swine disease image and text dataset(SDT6K)that covered six disease classification datasets with accuracy and F1-score of 94.48%and 94.4%respectively.The accuracies and F1-scores increased by 8.35%and 7.24%compared with those under the unimodal situation and by 2.02%and 1.63%compared with those of the optimal baseline model under the multimodal fusion.Additionally,interpretability analysis revealed that the model focus area was consistent with the habits and rules of the veterinary clinical diagnosis of pigs,indicating the effectiveness of the proposed model and providing new ideas and perspectives for the study of swine disease CAD. 展开更多
关键词 Computer-aided diagnosis Electronic health records multimodal fusion Self-supervised learning Swine disease
原文传递
Low-Rank Adapter Layers and Bidirectional Gated Feature Fusion for Multimodal Hateful Memes Classification
6
作者 Youwei Huang Han Zhong +1 位作者 Cheng Cheng Yijie Peng 《Computers, Materials & Continua》 2025年第7期1863-1882,共20页
Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces ... Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness. 展开更多
关键词 Hateful meme multimodal fusion multimodal data deep learning
在线阅读 下载PDF
Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour
7
作者 Yuqing Yang Chengguo Xu +3 位作者 Wenhao Hou Alan G.McElligott Kai Liu Yueju Xue 《Artificial Intelligence in Agriculture》 2025年第3期363-376,共14页
Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status.However,accurately identifying these behaviours for individual sows in complex indoor pig... Nursing behaviour and the calling-to-nurse sound are crucial indicators for assessing sow maternal behaviour and nursing status.However,accurately identifying these behaviours for individual sows in complex indoor pig housing is challenging due to factors such as variable lighting,rail obstructions,and interference from other sows'calls.Multimodal fusion,which integrates audio and visual data,has proven to be an effective approach for improving accuracy and robustness in complex scenarios.In this study,we designed an audio-visual data acquisition system that includes a camera for synchronised audio and video capture,along with a custom-developed sound source localisation system that leverages a sound sensor to track sound direction.Specifically,we proposed a novel transformer-based audio-visual multimodal fusion(TMF)framework for recognising fine-grained sow nursing behaviour with or without the calling-to-nurse sound.Initially,a unimodal self-attention enhancement(USE)module was employed to augment video and audio features with global contextual information.Subsequently,we developed an audio-visual interaction enhancement(AVIE)module to compress relevant information and reduce noise using the information bottleneck principle.Moreover,we presented an adaptive dynamic decision fusion strategy to optimise the model's performance by focusing on the most relevant features in each modality.Finally,we comprehensively identified fine-grained nursing behaviours by integrating audio and fused information,while incorporating angle information from the real-time sound source localisation system to accurately determine whether the sound cues originate from the target sow.Our results demonstrate that the proposed method achieves an accuracy of 98.42%for general sow nursing behaviour and 94.37%for fine-grained nursing behaviour,including nursing with and without the calling-to-nurse sound,and non-nursing behaviours.This fine-grained nursing information can provide a more nuanced understanding of the sow's health and lactation willingness,thereby enhancing management practices in pig farming. 展开更多
关键词 Bottleneck-based transformer Calling-to-nurse sound multimodal fusion Sound source localisation Sow
原文传递
Lightweight Classroom Student Action Recognition Method Based on Spatiotemporal Multimodal Feature Fusion
8
作者 Shaodong Zou Di Wu +2 位作者 Jianhou Gan Juxiang Zhou Jiatian Mei 《Computers, Materials & Continua》 2025年第4期1101-1116,共16页
The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,th... The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters. 展开更多
关键词 Action recognition student classroom action multimodal fusion lightweight model design
在线阅读 下载PDF
A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos
9
作者 ZHOU Ming WANG Tong 《Journal of Donghua University(English Edition)》 2025年第1期88-95,共8页
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t... Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset. 展开更多
关键词 video classification multimodal fusion feature alignment
在线阅读 下载PDF
Performance vs.Complexity Comparative Analysis of Multimodal Bilinear Pooling Fusion Approaches for Deep Learning-Based Visual Arabic-Question Answering Systems
10
作者 Sarah M.Kamel Mai A.Fadel +1 位作者 Lamiaa Elrefaei Shimaa I.Hassan 《Computer Modeling in Engineering & Sciences》 2025年第4期373-411,共39页
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate... Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions. 展开更多
关键词 Arabic-VQA deep learning-based VQA deep multimodal information fusion multimodal representation learning VQA of yes/no questions VQA model complexity VQA model performance performance-complexity trade-off
在线阅读 下载PDF
Multimodal medical image fusion based on mask optimization and parallel attention mechanism
11
作者 DI Jing LIANG Chan +1 位作者 GUO Wenqing LIAN Jing 《Journal of Measurement Science and Instrumentation》 2025年第1期26-36,共11页
Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability... Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability to fully extract fused image information.Therefore,a multimodal medical image fusion method based on mask optimization and parallel attention mechanism was proposed to address the aforementioned issues.Firstly,it converted the entire image into a binary mask,and constructed a contour feature map to maximize the contour feature information of the image and a triple path network for image texture detail feature extraction and optimization.Secondly,a contrast enhancement module and a detail preservation module were proposed to enhance the overall brightness and texture details of the image.Afterwards,a parallel attention mechanism was constructed using channel features and spatial feature changes to fuse images and enhance the salient information of the fused images.Finally,a decoupling network composed of residual networks was set up to optimize the information between the fused image and the source image so as to reduce information loss in the fused image.Compared with nine high-level methods proposed in recent years,the seven objective evaluation indicators of our method have improved by 6%−31%,indicating that this method can obtain fusion results with clearer texture details,higher contrast,and smaller pixel differences between the fused image and the source image.It is superior to other comparison algorithms in both subjective and objective indicators. 展开更多
关键词 multimodal medical image fusion binary mask contrast enhancement module parallel attention mechanism decoupling network
在线阅读 下载PDF
A Large-Scale Spatio-Temporal Multimodal Fusion Framework for Traffic Prediction 被引量:3
12
作者 Bodong Zhou Jiahui Liu +1 位作者 Songyi Cui Yaping Zhao 《Big Data Mining and Analytics》 EI CSCD 2024年第3期621-636,共16页
Traffic prediction is crucial for urban planning and transportation management,and deep learning techniques have emerged as effective tools for this task.While previous works have made advancements,they often overlook... Traffic prediction is crucial for urban planning and transportation management,and deep learning techniques have emerged as effective tools for this task.While previous works have made advancements,they often overlook comprehensive analyses of spatio-temporal distributions and the integration of multimodal representations.Our research addresses these limitations by proposing a large-scale spatio-temporal multimodal fusion framework that enables accurate predictions based on location queries and seamlessly integrates various data sources.Specifically,we utilize Convolutional Neural Networks(CNNs)for spatial information processing and a combination of Recurrent Neural Networks(RNNs)for final spatio-temporal traffic prediction.This framework not only effectively reveals its ability to integrate various modal data in the spatio-temporal hyperspace,but has also been successfully implemented in a real-world large-scale map,showcasing its practical importance in tackling urban traffic challenges.The findings presented in this work contribute to the advancement of traffic prediction methods,offering valuable insights for further research and application in addressing real-world transportation challenges. 展开更多
关键词 SPATIO-TEMPORAL traffic prediction multimodal fusion learning representation
原文传递
Multimodal Fusion of Brain Imaging Data: Methods and Applications 被引量:1
13
作者 Na Luo Weiyang Shi +2 位作者 Zhengyi Yang Ming Song Tianzi Jiang 《Machine Intelligence Research》 EI CSCD 2024年第1期136-152,共17页
Neuroimaging data typically include multiple modalities,such as structural or functional magnetic resonance imaging,dif-fusion tensor imaging,and positron emission tomography,which provide multiple views for observing... Neuroimaging data typically include multiple modalities,such as structural or functional magnetic resonance imaging,dif-fusion tensor imaging,and positron emission tomography,which provide multiple views for observing and analyzing the brain.To lever-age the complementary representations of different modalities,multimodal fusion is consequently needed to dig out both inter-modality and intra-modality information.With the exploited rich information,it is becoming popular to combine multiple modality data to ex-plore the structural and functional characteristics of the brain in both health and disease status.In this paper,we first review a wide spectrum of advanced machine learning methodologies for fusing multimodal brain imaging data,broadly categorized into unsupervised and supervised learning strategies.Followed by this,some representative applications are discussed,including how they help to under-stand the brain arealization,how they improve the prediction of behavioral phenotypes and brain aging,and how they accelerate the biomarker exploration of brain diseases.Finally,we discuss some exciting emerging trends and important future directions.Collectively,we intend to offer a comprehensive overview of brain imaging fusion methods and their successful applications,along with the chal-lenges imposed by multi-scale and big data,which arises an urgent demand on developing new models and platforms. 展开更多
关键词 multimodal fusion supervised learning unsupervised learning brain atlas COGNITION brain disorders
原文传递
Image Style Transfer for Exhibition Hall Design Based on Multimodal Semantic-Enhanced Algorithm
14
作者 Qing Xie Ruiyun Yu 《Computers, Materials & Continua》 2025年第7期1123-1144,共22页
Although existing style transfer techniques have made significant progress in the field of image generation,there are still some challenges in the field of exhibition hall design.The existing style transfer methods ma... Although existing style transfer techniques have made significant progress in the field of image generation,there are still some challenges in the field of exhibition hall design.The existing style transfer methods mainly focus on the transformation of single dimensional features,but ignore the deep integration of content and style features in exhibition hall design.In addition,existing methods are deficient in detail retention,especially in accurately capturing and reproducing local textures and details while preserving the content image structure.In addition,point-based attention mechanisms tend to ignore the complexity and diversity of image features in multi-dimensional space,resulting in alignment problems between features in different semantic areas,resulting in inconsistent stylistic features in content areas.In this context,this paper proposes a semantic-enhanced multimodal style transfer algorithm tailored for exhibition hall design.The proposed approach leverages a multimodal encoder architecture to integrate information from text,source images,and style images,using separate encoder modules for each modality to capture shallow,deep,and semantic features.A novel Style Transfer Convolution(STConv)convolutional kernel,based on the Visual Geometry Group(VGG)19 network,is introduced to improve feature extraction in style transfer.Additionally,an enhanced Transformer encoder is incorporated to capture contextual semantic information within images,while the CLIP model is employed for text data processing.A hybrid attention module is designed to precisely capture style features,achieving multimodal feature fusion via a diffusion model that generates exhibition hall design images aligned with stylistic requirements.Quantitative experiments show that compared with the most advanced algorithms,the proposed method has achieved significant performance improvement on both Fréchet Inception Distance(FID)and Kernel Inception Distance(KID)indexes.For example,on the ExpoArchive dataset,the proposed method has a FID value of 87.9 and a KID value of 1.98,which is significantly superior to other methods. 展开更多
关键词 Exhibition hall design style transfer multimodal fusion semantic enhancement diffusion model
在线阅读 下载PDF
A Disentangled Representation-Based Multimodal Fusion Framework Integrating Pathomics and Radiomics for KRAS Mutation Detection in Colorectal Cancer 被引量:1
15
作者 Zhilong Lv Rui Yan +3 位作者 Yuexiao Lin Lin Gao Fa Zhang Ying Wang 《Big Data Mining and Analytics》 EI CSCD 2024年第3期590-602,共13页
Kirsten rat sarcoma viral oncogene homolog(namely KRAS)is a key biomarker for prognostic analysis and targeted therapy of colorectal cancer.Recently,the advancement of machine learning,especially deep learning,has gre... Kirsten rat sarcoma viral oncogene homolog(namely KRAS)is a key biomarker for prognostic analysis and targeted therapy of colorectal cancer.Recently,the advancement of machine learning,especially deep learning,has greatly promoted the development of KRAS mutation detection from tumor phenotype data,such as pathology slides or radiology images.However,there are still two major problems in existing studies:inadequate single-modal feature learning and lack of multimodal phenotypic feature fusion.In this paper,we propose a Disentangled Representation-based Multimodal Fusion framework integrating Pathomics and Radiomics(DRMF-PaRa)for KRAS mutation detection.Specifically,the DRMF-PaRa model consists of three parts:(1)the pathomics learning module,which introduces a tissue-guided Transformer model to extract more comprehensive and targeted pathological features;(2)the radiomics learning module,which captures the generic hand-crafted radiomics features and the task-specific deep radiomics features;(3)the disentangled representation-based multimodal fusion module,which learns factorized subspaces for each modality and provides a holistic view of the two heterogeneous phenotypic features.The proposed model is developed and evaluated on a multi modality dataset of 111 colorectal cancer patients with whole slide images and contrast-enhanced CT.The experimental results demonstrate the superiority of the proposed DRMF-PaRa model with an accuracy of 0.876 and an AUC of 0.865 for KRAS mutation detection. 展开更多
关键词 KRAS mutation detection multimodal feature fusion pathomics radiomics
原文传递
A Comprehensive Review of Multimodal Deep Learning for Enhanced Medical Diagnostics 被引量:1
16
作者 Aya M.Al-Zoghby Ahmed Ismail Ebada +2 位作者 Aya S.Saleh Mohammed Abdelhay Wael A.Awad 《Computers, Materials & Continua》 2025年第9期4155-4193,共39页
Multimodal deep learning has emerged as a key paradigm in contemporary medical diagnostics,advancing precision medicine by enabling integration and learning from diverse data sources.The exponential growth of high-dim... Multimodal deep learning has emerged as a key paradigm in contemporary medical diagnostics,advancing precision medicine by enabling integration and learning from diverse data sources.The exponential growth of high-dimensional healthcare data,encompassing genomic,transcriptomic,and other omics profiles,as well as radiological imaging and histopathological slides,makes this approach increasingly important because,when examined separately,these data sources only offer a fragmented picture of intricate disease processes.Multimodal deep learning leverages the complementary properties of multiple data modalities to enable more accurate prognostic modeling,more robust disease characterization,and improved treatment decision-making.This review provides a comprehensive overview of the current state of multimodal deep learning approaches in medical diagnosis.We classify and examine important application domains,such as(1)radiology,where automated report generation and lesion detection are facilitated by image-text integration;(2)histopathology,where fusion models improve tumor classification and grading;and(3)multi-omics,where molecular subtypes and latent biomarkers are revealed through cross-modal learning.We provide an overview of representative research,methodological advancements,and clinical consequences for each domain.Additionally,we critically analyzed the fundamental issues preventing wider adoption,including computational complexity(particularly in training scalable,multi-branch networks),data heterogeneity(resulting from modality-specific noise,resolution variations,and inconsistent annotations),and the challenge of maintaining significant cross-modal correlations during fusion.These problems impede interpretability,which is crucial for clinical trust and use,in addition to performance and generalizability.Lastly,we outline important areas for future research,including the development of standardized protocols for harmonizing data,the creation of lightweight and interpretable fusion architectures,the integration of real-time clinical decision support systems,and the promotion of cooperation for federated multimodal learning.Our goal is to provide researchers and clinicians with a concise overview of the field’s present state,enduring constraints,and exciting directions for further research through this review. 展开更多
关键词 multimodal deep learning medical diagnostics multimodal healthcare fusion healthcare data integration
暂未订购
Research on Multimodal AIGC Video Detection for Identifying Fake Videos Generated by Large Models
17
作者 Yong Liu Tianning Sun +2 位作者 Daofu Gong Li Di Xu Zhao 《Computers, Materials & Continua》 2025年第10期1161-1184,共24页
To address the high-quality forged videos,traditional approaches typically have low recognition accuracy and tend to be easily misclassified.This paper tries to address the challenge of detecting high-quality deepfake... To address the high-quality forged videos,traditional approaches typically have low recognition accuracy and tend to be easily misclassified.This paper tries to address the challenge of detecting high-quality deepfake videos by promoting the accuracy of Artificial Intelligence Generated Content(AIGC)video authenticity detection with a multimodal information fusion approach.First,a high-quality multimodal video dataset is collected and normalized,including resolution correction and frame rate unification.Next,feature extraction techniques are employed to draw out features from visual,audio,and text modalities.Subsequently,these features are fused into a multilayer perceptron and attention mechanisms-based multimodal feature matrix.Finally,the matrix is fed into a multimodal information fusion layer in order to construct and train a deep learning model.Experimental findings show that the multimodal fusion model achieves an accuracy of 93.8%for the detection of video authenticity,showing significant improvement against other unimodal models,as well as affirming better performance and resistance of the model to AIGC video authenticity detection. 展开更多
关键词 multimodal information fusion artificial intelligence generated content authenticity detection feature extraction multi-layer perceptron attention mechanism
在线阅读 下载PDF
3D Vehicle Detection Algorithm Based onMultimodal Decision-Level Fusion
18
作者 Peicheng Shi Heng Qi +1 位作者 Zhiqiang Liu Aixi Yang 《Computer Modeling in Engineering & Sciences》 SCIE EI 2023年第6期2007-2023,共17页
3D vehicle detection based on LiDAR-camera fusion is becoming an emerging research topic in autonomous driving.The algorithm based on the Camera-LiDAR object candidate fusion method(CLOCs)is currently considered to be... 3D vehicle detection based on LiDAR-camera fusion is becoming an emerging research topic in autonomous driving.The algorithm based on the Camera-LiDAR object candidate fusion method(CLOCs)is currently considered to be a more effective decision-level fusion algorithm,but it does not fully utilize the extracted features of 3D and 2D.Therefore,we proposed a 3D vehicle detection algorithm based onmultimodal decision-level fusion.First,project the anchor point of the 3D detection bounding box into the 2D image,calculate the distance between 2D and 3D anchor points,and use this distance as a new fusion feature to enhance the feature redundancy of the network.Subsequently,add an attention module:squeeze-and-excitation networks,weight each feature channel to enhance the important features of the network,and suppress useless features.The experimental results show that the mean average precision of the algorithm in the KITTI dataset is 82.96%,which outperforms previous state-ofthe-art multimodal fusion-based methods,and the average accuracy in the Easy,Moderate and Hard evaluation indicators reaches 88.96%,82.60%,and 77.31%,respectively,which are higher compared to the original CLOCs model by 1.02%,2.29%,and 0.41%,respectively.Compared with the original CLOCs algorithm,our algorithm has higher accuracy and better performance in 3D vehicle detection. 展开更多
关键词 3D vehicle detection multimodal fusion CLOCs network structure optimization attention module
在线阅读 下载PDF
MFF-Net: Multimodal Feature Fusion Network for 3D Object Detection
19
作者 Peicheng Shi Zhiqiang Liu +1 位作者 Heng Qi Aixi Yang 《Computers, Materials & Continua》 SCIE EI 2023年第6期5615-5637,共23页
In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection ... In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection. 展开更多
关键词 3D object detection multimodal fusion neural network autonomous driving attention mechanism
在线阅读 下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部