Background:Medical imaging advancements are constrained by fundamental trade-offs between acquisition speed,radiation dose,and image quality,forcing clinicians to work with noisy,incomplete data.Existing reconstructio...Background:Medical imaging advancements are constrained by fundamental trade-offs between acquisition speed,radiation dose,and image quality,forcing clinicians to work with noisy,incomplete data.Existing reconstruction methods either compromise on accuracy with iterative algorithms or suffer from limited generalizability with task-specific deep learning approaches.Methods:We present LDM-PIR,a lightweight physics-conditioned diffusion multi-model for medical image reconstruction that addresses key challenges in magnetic resonance imaging(MRI),CT,and low-photon imaging.Unlike traditional iterative methods,which are computationally expensive,or task-specific deep learning approaches lacking generalizability,integrates three innovations.A physics-conditioned diffusion framework that embeds acquisition operators(Fourier/Radon transforms)and noise models directly into the reconstruction process.A multi-model architecture that unifies denoising,inpainting,and super-resolution via shared weight conditioning.A lightweight design(2.1M parameters)enabling rapid inference(0.8s/image on GPU).Through self-supervised fine-tuning with measurement consistency losses adapts to new imaging modalities using fewer annotated samples.Results:Achieves state-of-the-art performance on fastMRI(peak signal-to-noise ratio(PSNR):34.04 for single-coil/31.50 for multi-coil)and Lung Image Database Consortium and Image Database Resource Initiative(28.83 PSNR under Poisson noise).Clinical evaluations demonstrate superior preservation of anatomical structures,with SSIM improvements of 8.8%for single-coil and 4.36%for multi-coil MRI over uDPIR.Conclusion:It offers a flexible,efficient,and scalable solution for medical image reconstruction,addressing the challenges of noise,undersampling,and modality generalization.The model’s lightweight design allows for rapid inference,while its self-supervised fine-tuning capability minimizes reliance on large annotated datasets,making it suitable for real-world clinical applications.展开更多
Parkinson’s disease remains a major clinical issue in terms of early detection,especially during its prodromal stage when symptoms are not evident or not distinct.To address this problem,we proposed a new deep learni...Parkinson’s disease remains a major clinical issue in terms of early detection,especially during its prodromal stage when symptoms are not evident or not distinct.To address this problem,we proposed a new deep learning 2-based approach for detecting Parkinson’s disease before any of the overt symptoms develop during their prodromal stage.We used 5 publicly accessible datasets,including UCI Parkinson’s Voice,Spiral Drawings,PaHaW,NewHandPD,and PPMI,and implemented a dual stream CNN–BiLSTM architecture with Fisher-weighted feature merging and SHAP-based explanation.The findings reveal that the model’s performance was superior and achieved 98.2%,a F1-score of 0.981,and AUC of 0.991 on the UCI Voice dataset.The model’s performance on the remaining datasets was also comparable,with up to a 2–7 percent betterment in accuracy compared to existing strong models such as CNN–RNN–MLP,ILN–GNet,and CASENet.Across the evidence,the findings back the diagnostic promise of micro-tremor assessment and demonstrate that combining temporal and spatial features with a scatter-based segment for a multi-modal approach can be an effective and scalable platform for an“early,”interpretable PD screening system.展开更多
In fire rescue scenarios,traditional manual operations are highly dangerous,as dense smoke,low visibility,extreme heat,and toxic gases not only hinder rescue efficiency but also endanger firefighters’safety.Although ...In fire rescue scenarios,traditional manual operations are highly dangerous,as dense smoke,low visibility,extreme heat,and toxic gases not only hinder rescue efficiency but also endanger firefighters’safety.Although intelligent rescue robots can enter hazardous environments in place of humans,smoke poses major challenges for human detection algorithms.These challenges include the attenuation of visible and infrared signals,complex thermal fields,and interference frombackground objects,all ofwhichmake it difficult to accurately identify trapped individuals.To address this problem,we propose VIF-YOLO,a visible–infrared fusion model for real-time human detection in dense smoke environments.The framework introduces a lightweight multimodal fusion(LMF)module based on learnable low-rank representation blocks to end-to-end integrate visible and infrared images,preserving fine details while enhancing salient features.In addition,an efficient multiscale attention(EMA)mechanism is incorporated into the YOLOv10n backbone to improve feature representation under low-light conditions.Extensive experiments on our newly constructedmultimodal smoke human detection(MSHD)dataset demonstrate thatVIF-YOLOachievesmAP50 of 99.5%,precision of 99.2%,and recall of 99.3%,outperforming YOLOv10n by a clear margin.Furthermore,when deployed on the NVIDIA Jetson Xavier NX,VIF-YOLO attains 40.6 FPS with an average inference latency of 24.6 ms,validating its real-time capability on edge-computing platforms.These results confirm that VIF-YOLO provides accurate,robust,and fast detection across complex backgrounds and diverse smoke conditions,ensuring reliable and rapid localization of individuals in need of rescue.展开更多
Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.There...Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.Therefore,it is necessary to establish thunderstorm wind gust identification techniques based on multisource high-resolution observations.This paper introduces a new algorithm,called thunderstorm wind gust identification network(TGNet).It leverages multimodal feature fusion to fuse the temporal and spatial features of thunderstorm wind gust events.The shapelet transform is first used to extract the temporal features of wind speeds from automatic weather stations,which is aimed at distinguishing thunderstorm wind gusts from those caused by synoptic-scale systems or typhoons.Then,the encoder,structured upon the U-shaped network(U-Net)and incorporating recurrent residual convolutional blocks(R2U-Net),is employed to extract the corresponding spatial convective characteristics of satellite,radar,and lightning observations.Finally,by using the multimodal deep fusion module based on multi-head cross-attention,the temporal features of wind speed at each automatic weather station are incorporated into the spatial features to obtain 10-minutely classification of thunderstorm wind gusts.TGNet products have high accuracy,with a critical success index reaching 0.77.Compared with those of U-Net and R2U-Net,the false alarm rate of TGNet products decreases by 31.28%and 24.15%,respectively.The new algorithm provides grid products of thunderstorm wind gusts with a spatial resolution of 0.01°,updated every 10minutes.The results are finer and more accurate,thereby helping to improve the accuracy of operational warnings for thunderstorm wind gusts.展开更多
Artificial intelligence(AI)serves as a key technology in global industrial transformation and technological restructuring and as the core driver of the fourth industrial revolution.Currently,deep learning techniques,s...Artificial intelligence(AI)serves as a key technology in global industrial transformation and technological restructuring and as the core driver of the fourth industrial revolution.Currently,deep learning techniques,such as convolutional neural networks,enable intelligent information collection in fields such as tongue and pulse diagnosis owing to their robust feature-processing capabilities.Natural language processing models,including long short-term memory and transformers,have been applied to traditional Chinese medicine(TCM)for diagnosis,syndrome differentiation,and prescription generation.Traditional machine learning algorithms,such as neural networks,support vector machines,and random forests,are also widely used in TCM diagnosis and treatment because of their strong regression and classification performance on small structured datasets.Future research on AI in TCM diagnosis and treatment may emphasize building large-scale,high-quality TCM datasets with unified criteria based on syndrome elements;identifying algorithms suited to TCM theoretical data distributions;and leveraging AI multimodal fusion and ensemble learning techniques for diverse raw features,such as images,text,and manually processed structured data,to increase the clinical efficacy of TCM diagnosis and treatment.展开更多
Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of IS...Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of ISAR systems.Also,data scarcity poses a greater challenge to the accurate recognition of components.To address the issues of component recognition in complex ISAR targets,this paper adopts semantic segmentation and proposes a few-shot semantic segmentation framework fusing multimodal features.The scarcity of available data is mitigated by using a two-branch scattering feature encoding structure.Then,the high-resolution features are obtained by fusing the ISAR image texture features and scattering quantization information of complex-valued echoes,thereby achieving significantly higher structural adaptability.Meanwhile,the scattering trait enhancement module and the statistical quantification module are designed.The edge texture is enhanced based on the scatter quantization property,which alleviates the segmentation challenge of edge blurring under low SNR conditions.The coupling of query/support samples is enhanced through four-dimensional convolution.Additionally,to overcome fusion challenges caused by information differences,multimodal feature fusion is guided by equilibrium comprehension loss.In this way,the performance potential of the fusion framework is fully unleashed,and the decision risk is effectively reduced.Experiments demonstrate the great advantages of the proposed framework in multimodal feature fusion,and it still exhibits great component segmentation capability under low SNR/edge blurring conditions.展开更多
Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing...Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.展开更多
Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting p...Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting prognosis.Existing methods struggle with large‐scale screening,multitracking,and over‐reliance on single‐modality data,ignoring the potential of multisource complementary information.Key technical challenges-effective data collection,multimodal feature extraction/fusion,and AI model construction-limit clinical application.Thus,exploring AI,new sensors,and existing data for efficient,fast,accurate,and radiation‐free preliminary diagnosis is crucial for timely treatment and improved outcomes.Methods:This study collected hematological data,and used fiber‐optic vibration sensors and audio sensors to capture heterogeneous signals of patients'lung respiration.Fiber‐optic respiratory frequency,audio‐respiratory rhythm,and hematological leukocyterelated features were extracted,optimized as multimodal inputs.The SCCA‐LMF fusion method generated fusion samples,which were input into an improved stacking ensemble learning model(including SVM,XGBoost,etc.)for binary classification.Results:The experiment included 360 actual samples(lung cancer:nonlung cancer=3.6:1)with complete data of 55-65‐yearold males and females.Predictive accuracy,sensitivity,specificity,and F1 score reached 97.70%,95.75%,99.64%,and 99.64%,respectively,outperforming existing independent LMF and TFN methods.This model effectively integrates respiratory vibration,audio signals,and routine blood tests.A multimodal feature grading fusion strategy was designed for 3D data analysis to comprehensively understand patient health and enhance prediction capabilities.All data and results are reproducible.Conclusion:This study demonstrates the method's potential for lung cancer preliminary identification,bridging medicine and engineering to improve healthcare outcomes.展开更多
Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces ...Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.展开更多
The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,th...The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.展开更多
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t...Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.展开更多
Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,...Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,we develop a novel SOH prediction model integrating physical information constraints and multimodal feature fusion.Our approach employs a multi-channel encoder to process heterogeneous data modalities,including health indicators,raw charge/discharge sequences,and incremental capacity data,and uses multi-channel encoders to achieve structured input.A physics-informed loss function,derived from an empirical capacity decay equation,is incorporated to enforce interpretability,while a cross-layer attention mechanism dynamically weights features to handle missing modalities and random noise.Experimental validation on multiple battery types demonstrates that our model reduces mean absolute error(MAE)by at least 51.09%compared to unimodal baselines,maintains robustness under adverse conditions such as partial data loss,and achieves an average MAE of 0.0201 in real-world battery pack applications.This model significantly enhances the accuracy and universality of prediction,enabling accurate prediction of battery SOH under actual engineering conditions.展开更多
Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecul...Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecular alterations.Recent advances in artificial intelligence(AI),particularly deep learning,have enabled the integration of diverse clinical data modalities to support diagnosis,treatment planning,and prognostication in bone oncology.This review provides a comprehensive synthesis of AI-driven multimodal fusion strategies that incorporate radiological imaging,digital pathology,multi-omics profiling,and electronic health records.We conducted a structured review of peer-reviewed literature published between 2015 and early 2025,focusing on the development,validation,and clinical applicability of AI models for BT diagnosis,subtyping,treatment response prediction,and recurrence monitoring.Although multimodal models have demonstrated advantages over unimodal approaches,especially in handling missing data and improving generalizability,most remain constrained by single-center study designs,small sample sizes,and limited prospective or external validation.Persistent technical and translational challenges include semantic misalignment across modalities,incomplete datasets,limited model interpretability,and regulatory and infrastructural barriers to clinical integration.To address these limitations,we highlight emerging directions such as contrastive representation learning,generative data augmentation,transformer-based fusion architectures,and privacy-preserving federated learning.We also discuss the evolving role of foundation models and workflow-integrated AI agents in enhancing scalability and clinical usability.In summary,multimodal AI represents a promising paradigm for advancing precision care in BTs.Realizing its full clinical potential will require methodologically rigorous,biologically informed,and system-level approaches that bridge algorithmic innovation with real-world healthcare delivery.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability...Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability to fully extract fused image information.Therefore,a multimodal medical image fusion method based on mask optimization and parallel attention mechanism was proposed to address the aforementioned issues.Firstly,it converted the entire image into a binary mask,and constructed a contour feature map to maximize the contour feature information of the image and a triple path network for image texture detail feature extraction and optimization.Secondly,a contrast enhancement module and a detail preservation module were proposed to enhance the overall brightness and texture details of the image.Afterwards,a parallel attention mechanism was constructed using channel features and spatial feature changes to fuse images and enhance the salient information of the fused images.Finally,a decoupling network composed of residual networks was set up to optimize the information between the fused image and the source image so as to reduce information loss in the fused image.Compared with nine high-level methods proposed in recent years,the seven objective evaluation indicators of our method have improved by 6%−31%,indicating that this method can obtain fusion results with clearer texture details,higher contrast,and smaller pixel differences between the fused image and the source image.It is superior to other comparison algorithms in both subjective and objective indicators.展开更多
Bird's-eye-view(BEV)perception is a core technology for autonomous driving systems.However,existing solutions face the dilemma of high costs associated with multimodal methods and limited performance of vision-onl...Bird's-eye-view(BEV)perception is a core technology for autonomous driving systems.However,existing solutions face the dilemma of high costs associated with multimodal methods and limited performance of vision-only approaches.To address this issue,this paper proposes a framework named“a lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge”.This framework innovatively designs a lightweight vision-only student model based on Res Net,which leverages a dual distillation mechanism to learn from a powerful teacher model that integrates temporal information from both image and light detection and ranging(LiDAR)modalities.Specifically,we distill efficient multi-modal feature extraction and spatial fusion capabilities from the BEVFusion model,and distill advanced temporal information fusion and spatiotemporal attention mechanisms from the BEVFormer model.This dual distillation strategy enables the student model to achieve perception performance close to that of multi-modal models without relying on Li DAR.Experimental results on the nu Scenes dataset demonstrate that the proposed model significantly outperforms classical vision-only algorithms,achieves comparable performance to current state-of-the-art vision-only methods on the nu Scenes detection leaderboard in terms of both mean average precision(mAP)and the nu Scenes detection score(NDS)metrics,and exhibits notable advantages in inference computational efficiency.Although the proposed dual-teacher paradigm incurs higher offline training costs compared to single-model approaches,it yields a streamlined and highly efficient student model suitable for resource-constrained real-time deployment.This provides an effective pathway toward low-cost,high-performance autonomous driving perception systems.展开更多
Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks ...Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks is robust sensing under the coexistence of low signal-to-noise ratios(SNRs)and non-Gaussian impulsive noise,where observations may be distorted differently across feature modalities,making conventional fusion unstable and degrading detection reliability.To address this challenge,the generalized Gaussian distribution(GGD)is adopted as the noise model,and a multimodal fusion framework termed BCAM-Net(bidirectional cross-attention multimodal network)is proposed.BCAM-Net adopts a parallel dual-branch architecture:a time-frequency branch that leverages the continuous wavelet transform(CWT)to extract time-frequency representations,and a temporal branch that learns long-range dependencies from raw signals.BCAM-Net utilizes a bidirectional cross-attention mechanism to achieve deep alignment and mutual calibration of temporal and time-frequency features,generating a fused representation that is highly robust to complex noise.Simulation results show that,under GGD noise with shape parameterβ=0.5,BCAM-Net achieves high detection probabilities in the low-SNR regime and outperforms representative baselines.At a false alarm probability Pf=0.1 and SNR of−14 dB,it attains a detection probability of 0.9020,exceeding the CNN-Transformer,WT-ResNet,TFCFN,and conventional CNN benchmarks by 5.75%,6.98%,33.3%,and 21.1%,respectively.These results indicate that BCAM-Net can effectively improve spectrum sensing performance in low-SNR impulsive-noise scenarios,and provides a lightweight,high-performance solution for practical cognitive radio spectrum sensing.展开更多
Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness un...Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness under complex working conditions and a general lack of interpretability.To address these challenges,we propose a physics-informed multimodal fault diagnosis framework based on few-shot learning,which integrates a 2D timefrequency image encoder and a 1Dvibration signal encoder.Specifically,we embed prior knowledge ofmulti-resolution analysis from signal processing into the model by designing a Laplace Wavelet Convolution(LWC)module,which enhances interpretability since wavelet coefficients naturally correspond to specific frequency and temporal structures.To further balance the guidance of physical priors with the flexibility of learnable representations,we introduce a parametric multi-kernel wavelet that employs channel-wise dynamic attention to adaptively select relevant wavelet bases,thereby improving the feature expressiveness.Moreover,we develop a Mahalanobis-Prototype Joint Metric,which constructs more accurate and distribution-consistent decision boundaries under few-shot conditions.Comprehensive experiments on the Case Western Reserve University(CWRU)and Paderborn University(PU)bearing datasets demonstrate the superior effectiveness,robustness,and interpretability of the proposed approach compared with state-of-the-art baselines.展开更多
By 2025,research on Traditional Chinese Medicine(TCM)meridians has generated 12-15 macro-level theories and over 20 specific hypotheses,manifesting a highly fragmented research landscape.Objective:This paper proposes ...By 2025,research on Traditional Chinese Medicine(TCM)meridians has generated 12-15 macro-level theories and over 20 specific hypotheses,manifesting a highly fragmented research landscape.Objective:This paper proposes the“Holistic Hierarchical Predictive-Integration Hypothesis”(HHPIT)to construct a unified theoretical framework that integrates the rational components of existing meridian hypotheses.Methods:The HHPIT hypothesis systematically reviews current meridian theories,employs interdisciplinary methodologies,integrates artificial intelligence technology,and establishes a three-tier architecture encompassing structural,functional,and systemic layers.Results:HHPIT successfully integrates diverse meridian theories,proposes a computable algorithmic pipeline,and provides specific application protocols for chronic disease treatment,anti-aging,and enhancement of Zang-fu organ functions.Conclusion:HHPIT offers a novel,computable,and verifiable research paradigm for meridian studies,promoting the modernization and internationalization of TCM theory.展开更多
文摘Background:Medical imaging advancements are constrained by fundamental trade-offs between acquisition speed,radiation dose,and image quality,forcing clinicians to work with noisy,incomplete data.Existing reconstruction methods either compromise on accuracy with iterative algorithms or suffer from limited generalizability with task-specific deep learning approaches.Methods:We present LDM-PIR,a lightweight physics-conditioned diffusion multi-model for medical image reconstruction that addresses key challenges in magnetic resonance imaging(MRI),CT,and low-photon imaging.Unlike traditional iterative methods,which are computationally expensive,or task-specific deep learning approaches lacking generalizability,integrates three innovations.A physics-conditioned diffusion framework that embeds acquisition operators(Fourier/Radon transforms)and noise models directly into the reconstruction process.A multi-model architecture that unifies denoising,inpainting,and super-resolution via shared weight conditioning.A lightweight design(2.1M parameters)enabling rapid inference(0.8s/image on GPU).Through self-supervised fine-tuning with measurement consistency losses adapts to new imaging modalities using fewer annotated samples.Results:Achieves state-of-the-art performance on fastMRI(peak signal-to-noise ratio(PSNR):34.04 for single-coil/31.50 for multi-coil)and Lung Image Database Consortium and Image Database Resource Initiative(28.83 PSNR under Poisson noise).Clinical evaluations demonstrate superior preservation of anatomical structures,with SSIM improvements of 8.8%for single-coil and 4.36%for multi-coil MRI over uDPIR.Conclusion:It offers a flexible,efficient,and scalable solution for medical image reconstruction,addressing the challenges of noise,undersampling,and modality generalization.The model’s lightweight design allows for rapid inference,while its self-supervised fine-tuning capability minimizes reliance on large annotated datasets,making it suitable for real-world clinical applications.
基金supported via funding from Prince Sattam bin Abdulaziz University project number(PSAU/2025/03/32440).
文摘Parkinson’s disease remains a major clinical issue in terms of early detection,especially during its prodromal stage when symptoms are not evident or not distinct.To address this problem,we proposed a new deep learning 2-based approach for detecting Parkinson’s disease before any of the overt symptoms develop during their prodromal stage.We used 5 publicly accessible datasets,including UCI Parkinson’s Voice,Spiral Drawings,PaHaW,NewHandPD,and PPMI,and implemented a dual stream CNN–BiLSTM architecture with Fisher-weighted feature merging and SHAP-based explanation.The findings reveal that the model’s performance was superior and achieved 98.2%,a F1-score of 0.981,and AUC of 0.991 on the UCI Voice dataset.The model’s performance on the remaining datasets was also comparable,with up to a 2–7 percent betterment in accuracy compared to existing strong models such as CNN–RNN–MLP,ILN–GNet,and CASENet.Across the evidence,the findings back the diagnostic promise of micro-tremor assessment and demonstrate that combining temporal and spatial features with a scatter-based segment for a multi-modal approach can be an effective and scalable platform for an“early,”interpretable PD screening system.
基金funded by the National Natural Science Foundation of China under Grant 62306128the Leading Innovation Project of Changzhou Science and Technology Bureau underGrant CQ20230072+2 种基金the Basic Science Research Project of Jiangsu Provincial Department of Education under Grant 23KJD520003the Science and Technology Development Plan Project of Jilin Provinceunder Grant 20240101382JCthe National KeyR esearch and Development Program of China under Grant 2023YFF1105102.
文摘In fire rescue scenarios,traditional manual operations are highly dangerous,as dense smoke,low visibility,extreme heat,and toxic gases not only hinder rescue efficiency but also endanger firefighters’safety.Although intelligent rescue robots can enter hazardous environments in place of humans,smoke poses major challenges for human detection algorithms.These challenges include the attenuation of visible and infrared signals,complex thermal fields,and interference frombackground objects,all ofwhichmake it difficult to accurately identify trapped individuals.To address this problem,we propose VIF-YOLO,a visible–infrared fusion model for real-time human detection in dense smoke environments.The framework introduces a lightweight multimodal fusion(LMF)module based on learnable low-rank representation blocks to end-to-end integrate visible and infrared images,preserving fine details while enhancing salient features.In addition,an efficient multiscale attention(EMA)mechanism is incorporated into the YOLOv10n backbone to improve feature representation under low-light conditions.Extensive experiments on our newly constructedmultimodal smoke human detection(MSHD)dataset demonstrate thatVIF-YOLOachievesmAP50 of 99.5%,precision of 99.2%,and recall of 99.3%,outperforming YOLOv10n by a clear margin.Furthermore,when deployed on the NVIDIA Jetson Xavier NX,VIF-YOLO attains 40.6 FPS with an average inference latency of 24.6 ms,validating its real-time capability on edge-computing platforms.These results confirm that VIF-YOLO provides accurate,robust,and fast detection across complex backgrounds and diverse smoke conditions,ensuring reliable and rapid localization of individuals in need of rescue.
基金supported by the National Key Research and Development Program of China(Grant No.2022YFC3004104)the National Natural Science Foundation of China(Grant No.U2342204)+4 种基金the Innovation and Development Program of the China Meteorological Administration(Grant No.CXFZ2024J001)the Open Research Project of the Key Open Laboratory of Hydrology and Meteorology of the China Meteorological Administration(Grant No.23SWQXZ010)the Science and Technology Plan Project of Zhejiang Province(Grant No.2022C03150)the Open Research Fund Project of Anyang National Climate Observatory(Grant No.AYNCOF202401)the Open Bidding for Selecting the Best Candidates Program(Grant No.CMAJBGS202318)。
文摘Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.Therefore,it is necessary to establish thunderstorm wind gust identification techniques based on multisource high-resolution observations.This paper introduces a new algorithm,called thunderstorm wind gust identification network(TGNet).It leverages multimodal feature fusion to fuse the temporal and spatial features of thunderstorm wind gust events.The shapelet transform is first used to extract the temporal features of wind speeds from automatic weather stations,which is aimed at distinguishing thunderstorm wind gusts from those caused by synoptic-scale systems or typhoons.Then,the encoder,structured upon the U-shaped network(U-Net)and incorporating recurrent residual convolutional blocks(R2U-Net),is employed to extract the corresponding spatial convective characteristics of satellite,radar,and lightning observations.Finally,by using the multimodal deep fusion module based on multi-head cross-attention,the temporal features of wind speed at each automatic weather station are incorporated into the spatial features to obtain 10-minutely classification of thunderstorm wind gusts.TGNet products have high accuracy,with a critical success index reaching 0.77.Compared with those of U-Net and R2U-Net,the false alarm rate of TGNet products decreases by 31.28%and 24.15%,respectively.The new algorithm provides grid products of thunderstorm wind gusts with a spatial resolution of 0.01°,updated every 10minutes.The results are finer and more accurate,thereby helping to improve the accuracy of operational warnings for thunderstorm wind gusts.
基金supported by grants from the National Natural Science Foundation of China(Key Program)(No.82230124)Traditional Chinese Medicine Inheritance and Innovation“Ten million”talent project-Qihuang Project Chief Scientist Project(No.0201000401)+1 种基金State Administration of Traditional Chinese Medicine 2nd National Traditional Chinese Medicine Inheritance Studio Construction Project(Official Letter of the State Office of Traditional Chinese Medicine[2022]No.245)National Natural Science Foundation of China(General Program)(No.81974556).
文摘Artificial intelligence(AI)serves as a key technology in global industrial transformation and technological restructuring and as the core driver of the fourth industrial revolution.Currently,deep learning techniques,such as convolutional neural networks,enable intelligent information collection in fields such as tongue and pulse diagnosis owing to their robust feature-processing capabilities.Natural language processing models,including long short-term memory and transformers,have been applied to traditional Chinese medicine(TCM)for diagnosis,syndrome differentiation,and prescription generation.Traditional machine learning algorithms,such as neural networks,support vector machines,and random forests,are also widely used in TCM diagnosis and treatment because of their strong regression and classification performance on small structured datasets.Future research on AI in TCM diagnosis and treatment may emphasize building large-scale,high-quality TCM datasets with unified criteria based on syndrome elements;identifying algorithms suited to TCM theoretical data distributions;and leveraging AI multimodal fusion and ensemble learning techniques for diverse raw features,such as images,text,and manually processed structured data,to increase the clinical efficacy of TCM diagnosis and treatment.
文摘Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of ISAR systems.Also,data scarcity poses a greater challenge to the accurate recognition of components.To address the issues of component recognition in complex ISAR targets,this paper adopts semantic segmentation and proposes a few-shot semantic segmentation framework fusing multimodal features.The scarcity of available data is mitigated by using a two-branch scattering feature encoding structure.Then,the high-resolution features are obtained by fusing the ISAR image texture features and scattering quantization information of complex-valued echoes,thereby achieving significantly higher structural adaptability.Meanwhile,the scattering trait enhancement module and the statistical quantification module are designed.The edge texture is enhanced based on the scatter quantization property,which alleviates the segmentation challenge of edge blurring under low SNR conditions.The coupling of query/support samples is enhanced through four-dimensional convolution.Additionally,to overcome fusion challenges caused by information differences,multimodal feature fusion is guided by equilibrium comprehension loss.In this way,the performance potential of the fusion framework is fully unleashed,and the decision risk is effectively reduced.Experiments demonstrate the great advantages of the proposed framework in multimodal feature fusion,and it still exhibits great component segmentation capability under low SNR/edge blurring conditions.
基金supported by the National Key Research and Development Program of China(Grant No.:2023YFC2605002)the National Key R&D Program of China(Grant No.:2022YFF1203003)+2 种基金Beijing AI Health Cultivation Project,China(Grant No.:Z221100003522022)the National Natural Science Foundation of China(Grant No.:82273772)the Beijing Natural Science Foundation,China(Grant No.:7212152).
文摘Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.
基金the Natural Science Foundation of Gansu Province(No.20JR10RA614,22YF7GA182,22JR11RA042,22JR5RA1006)the National Natural Science Foundation o Gansu Province(No.24CXGA024)+3 种基金the Industrial Support Plan for Higher Education Institutions in Gansu Province(No.CYZC-2024-10)the Open Fund of Key Laboratory of Time and Frequency Primary Standards,CAS,the Gansu Provincial University Industry Support Plan Project(2022CYZC-072022)the Lanzhou Chengguan District Science and Technology Plan Project(2021RCCX0031)Lanzhou Science and Technology Program(No.2024-4-38).
文摘Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting prognosis.Existing methods struggle with large‐scale screening,multitracking,and over‐reliance on single‐modality data,ignoring the potential of multisource complementary information.Key technical challenges-effective data collection,multimodal feature extraction/fusion,and AI model construction-limit clinical application.Thus,exploring AI,new sensors,and existing data for efficient,fast,accurate,and radiation‐free preliminary diagnosis is crucial for timely treatment and improved outcomes.Methods:This study collected hematological data,and used fiber‐optic vibration sensors and audio sensors to capture heterogeneous signals of patients'lung respiration.Fiber‐optic respiratory frequency,audio‐respiratory rhythm,and hematological leukocyterelated features were extracted,optimized as multimodal inputs.The SCCA‐LMF fusion method generated fusion samples,which were input into an improved stacking ensemble learning model(including SVM,XGBoost,etc.)for binary classification.Results:The experiment included 360 actual samples(lung cancer:nonlung cancer=3.6:1)with complete data of 55-65‐yearold males and females.Predictive accuracy,sensitivity,specificity,and F1 score reached 97.70%,95.75%,99.64%,and 99.64%,respectively,outperforming existing independent LMF and TFN methods.This model effectively integrates respiratory vibration,audio signals,and routine blood tests.A multimodal feature grading fusion strategy was designed for 3D data analysis to comprehensively understand patient health and enhance prediction capabilities.All data and results are reproducible.Conclusion:This study demonstrates the method's potential for lung cancer preliminary identification,bridging medicine and engineering to improve healthcare outcomes.
基金supported by the Funding for Research on the Evolution of Cyberbullying Incidents and Intervention Strategies(24BSH033)Discipline Innovation and Talent Introduction Bases in Higher Education Institutions(B20087).
文摘Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.
基金supported by the National Natural Science Foundation of China under Grant 62107034the Major Science and Technology Project of Yunnan Province(202402AD080002)Yunnan International Joint R&D Center of China-Laos-Thailand Educational Digitalization(202203AP140006).
文摘The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)National Natural Science Foundation of China(No.61903078)+1 种基金Shanghai Sailing Program,China(No.22YF1401300)Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
文摘Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.
基金Project(2023YFB2303704-07)supported by the National Natural Science Foundation of China。
文摘Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,we develop a novel SOH prediction model integrating physical information constraints and multimodal feature fusion.Our approach employs a multi-channel encoder to process heterogeneous data modalities,including health indicators,raw charge/discharge sequences,and incremental capacity data,and uses multi-channel encoders to achieve structured input.A physics-informed loss function,derived from an empirical capacity decay equation,is incorporated to enforce interpretability,while a cross-layer attention mechanism dynamically weights features to handle missing modalities and random noise.Experimental validation on multiple battery types demonstrates that our model reduces mean absolute error(MAE)by at least 51.09%compared to unimodal baselines,maintains robustness under adverse conditions such as partial data loss,and achieves an average MAE of 0.0201 in real-world battery pack applications.This model significantly enhances the accuracy and universality of prediction,enabling accurate prediction of battery SOH under actual engineering conditions.
基金supported by the National Natural Science Foundation of China[Grant No.:82172524]the Natural Science Foundation of Hubei Province[Grant No.:2025AFB240].
文摘Bone tumors(BTs)-including osteosarcoma,Ewing sarcoma,and chondrosarcoma-are rare but biologically complex malignancies characterized by pronounced heterogeneity in anatomical location,histological subtype,and molecular alterations.Recent advances in artificial intelligence(AI),particularly deep learning,have enabled the integration of diverse clinical data modalities to support diagnosis,treatment planning,and prognostication in bone oncology.This review provides a comprehensive synthesis of AI-driven multimodal fusion strategies that incorporate radiological imaging,digital pathology,multi-omics profiling,and electronic health records.We conducted a structured review of peer-reviewed literature published between 2015 and early 2025,focusing on the development,validation,and clinical applicability of AI models for BT diagnosis,subtyping,treatment response prediction,and recurrence monitoring.Although multimodal models have demonstrated advantages over unimodal approaches,especially in handling missing data and improving generalizability,most remain constrained by single-center study designs,small sample sizes,and limited prospective or external validation.Persistent technical and translational challenges include semantic misalignment across modalities,incomplete datasets,limited model interpretability,and regulatory and infrastructural barriers to clinical integration.To address these limitations,we highlight emerging directions such as contrastive representation learning,generative data augmentation,transformer-based fusion architectures,and privacy-preserving federated learning.We also discuss the evolving role of foundation models and workflow-integrated AI agents in enhancing scalability and clinical usability.In summary,multimodal AI represents a promising paradigm for advancing precision care in BTs.Realizing its full clinical potential will require methodologically rigorous,biologically informed,and system-level approaches that bridge algorithmic innovation with real-world healthcare delivery.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金supported by Gansu Natural Science Foundation Programme(No.24JRRA231)National Natural Science Foundation of China(No.62061023)Gansu Provincial Education,Science and Technology Innovation and Industry(No.2021CYZC-04)。
文摘Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability to fully extract fused image information.Therefore,a multimodal medical image fusion method based on mask optimization and parallel attention mechanism was proposed to address the aforementioned issues.Firstly,it converted the entire image into a binary mask,and constructed a contour feature map to maximize the contour feature information of the image and a triple path network for image texture detail feature extraction and optimization.Secondly,a contrast enhancement module and a detail preservation module were proposed to enhance the overall brightness and texture details of the image.Afterwards,a parallel attention mechanism was constructed using channel features and spatial feature changes to fuse images and enhance the salient information of the fused images.Finally,a decoupling network composed of residual networks was set up to optimize the information between the fused image and the source image so as to reduce information loss in the fused image.Compared with nine high-level methods proposed in recent years,the seven objective evaluation indicators of our method have improved by 6%−31%,indicating that this method can obtain fusion results with clearer texture details,higher contrast,and smaller pixel differences between the fused image and the source image.It is superior to other comparison algorithms in both subjective and objective indicators.
基金supported by the National Natural Science Foundation of China(42476084,62203456,42276199)the Stable Support Project of National Key Laboratory(WDZC 20245250302)the National Key R&D Program of China(2024YFC2813502,2024YFC2813302)。
文摘Bird's-eye-view(BEV)perception is a core technology for autonomous driving systems.However,existing solutions face the dilemma of high costs associated with multimodal methods and limited performance of vision-only approaches.To address this issue,this paper proposes a framework named“a lightweight pure visual BEV perception method based on dual distillation of spatial-temporal knowledge”.This framework innovatively designs a lightweight vision-only student model based on Res Net,which leverages a dual distillation mechanism to learn from a powerful teacher model that integrates temporal information from both image and light detection and ranging(LiDAR)modalities.Specifically,we distill efficient multi-modal feature extraction and spatial fusion capabilities from the BEVFusion model,and distill advanced temporal information fusion and spatiotemporal attention mechanisms from the BEVFormer model.This dual distillation strategy enables the student model to achieve perception performance close to that of multi-modal models without relying on Li DAR.Experimental results on the nu Scenes dataset demonstrate that the proposed model significantly outperforms classical vision-only algorithms,achieves comparable performance to current state-of-the-art vision-only methods on the nu Scenes detection leaderboard in terms of both mean average precision(mAP)and the nu Scenes detection score(NDS)metrics,and exhibits notable advantages in inference computational efficiency.Although the proposed dual-teacher paradigm incurs higher offline training costs compared to single-model approaches,it yields a streamlined and highly efficient student model suitable for resource-constrained real-time deployment.This provides an effective pathway toward low-cost,high-performance autonomous driving perception systems.
基金supported in part by JSPS Grants-in-Aid for Scientific Research 25K07742 and 25K23457.
文摘Spectrum sensing is an indispensable core part of cognitive radio dynamic spectrum access(DSA)and a key approach to alleviating spectrum scarcity in the Internet of Things(IoT).The key issue in practical IoT networks is robust sensing under the coexistence of low signal-to-noise ratios(SNRs)and non-Gaussian impulsive noise,where observations may be distorted differently across feature modalities,making conventional fusion unstable and degrading detection reliability.To address this challenge,the generalized Gaussian distribution(GGD)is adopted as the noise model,and a multimodal fusion framework termed BCAM-Net(bidirectional cross-attention multimodal network)is proposed.BCAM-Net adopts a parallel dual-branch architecture:a time-frequency branch that leverages the continuous wavelet transform(CWT)to extract time-frequency representations,and a temporal branch that learns long-range dependencies from raw signals.BCAM-Net utilizes a bidirectional cross-attention mechanism to achieve deep alignment and mutual calibration of temporal and time-frequency features,generating a fused representation that is highly robust to complex noise.Simulation results show that,under GGD noise with shape parameterβ=0.5,BCAM-Net achieves high detection probabilities in the low-SNR regime and outperforms representative baselines.At a false alarm probability Pf=0.1 and SNR of−14 dB,it attains a detection probability of 0.9020,exceeding the CNN-Transformer,WT-ResNet,TFCFN,and conventional CNN benchmarks by 5.75%,6.98%,33.3%,and 21.1%,respectively.These results indicate that BCAM-Net can effectively improve spectrum sensing performance in low-SNR impulsive-noise scenarios,and provides a lightweight,high-performance solution for practical cognitive radio spectrum sensing.
文摘Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis.However,most existing approaches suffer from the scarcity of labeled data,which often results in insufficient robustness under complex working conditions and a general lack of interpretability.To address these challenges,we propose a physics-informed multimodal fault diagnosis framework based on few-shot learning,which integrates a 2D timefrequency image encoder and a 1Dvibration signal encoder.Specifically,we embed prior knowledge ofmulti-resolution analysis from signal processing into the model by designing a Laplace Wavelet Convolution(LWC)module,which enhances interpretability since wavelet coefficients naturally correspond to specific frequency and temporal structures.To further balance the guidance of physical priors with the flexibility of learnable representations,we introduce a parametric multi-kernel wavelet that employs channel-wise dynamic attention to adaptively select relevant wavelet bases,thereby improving the feature expressiveness.Moreover,we develop a Mahalanobis-Prototype Joint Metric,which constructs more accurate and distribution-consistent decision boundaries under few-shot conditions.Comprehensive experiments on the Case Western Reserve University(CWRU)and Paderborn University(PU)bearing datasets demonstrate the superior effectiveness,robustness,and interpretability of the proposed approach compared with state-of-the-art baselines.
文摘By 2025,research on Traditional Chinese Medicine(TCM)meridians has generated 12-15 macro-level theories and over 20 specific hypotheses,manifesting a highly fragmented research landscape.Objective:This paper proposes the“Holistic Hierarchical Predictive-Integration Hypothesis”(HHPIT)to construct a unified theoretical framework that integrates the rational components of existing meridian hypotheses.Methods:The HHPIT hypothesis systematically reviews current meridian theories,employs interdisciplinary methodologies,integrates artificial intelligence technology,and establishes a three-tier architecture encompassing structural,functional,and systemic layers.Results:HHPIT successfully integrates diverse meridian theories,proposes a computable algorithmic pipeline,and provides specific application protocols for chronic disease treatment,anti-aging,and enhancement of Zang-fu organ functions.Conclusion:HHPIT offers a novel,computable,and verifiable research paradigm for meridian studies,promoting the modernization and internationalization of TCM theory.