Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.There...Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.Therefore,it is necessary to establish thunderstorm wind gust identification techniques based on multisource high-resolution observations.This paper introduces a new algorithm,called thunderstorm wind gust identification network(TGNet).It leverages multimodal feature fusion to fuse the temporal and spatial features of thunderstorm wind gust events.The shapelet transform is first used to extract the temporal features of wind speeds from automatic weather stations,which is aimed at distinguishing thunderstorm wind gusts from those caused by synoptic-scale systems or typhoons.Then,the encoder,structured upon the U-shaped network(U-Net)and incorporating recurrent residual convolutional blocks(R2U-Net),is employed to extract the corresponding spatial convective characteristics of satellite,radar,and lightning observations.Finally,by using the multimodal deep fusion module based on multi-head cross-attention,the temporal features of wind speed at each automatic weather station are incorporated into the spatial features to obtain 10-minutely classification of thunderstorm wind gusts.TGNet products have high accuracy,with a critical success index reaching 0.77.Compared with those of U-Net and R2U-Net,the false alarm rate of TGNet products decreases by 31.28%and 24.15%,respectively.The new algorithm provides grid products of thunderstorm wind gusts with a spatial resolution of 0.01°,updated every 10minutes.The results are finer and more accurate,thereby helping to improve the accuracy of operational warnings for thunderstorm wind gusts.展开更多
Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of IS...Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of ISAR systems.Also,data scarcity poses a greater challenge to the accurate recognition of components.To address the issues of component recognition in complex ISAR targets,this paper adopts semantic segmentation and proposes a few-shot semantic segmentation framework fusing multimodal features.The scarcity of available data is mitigated by using a two-branch scattering feature encoding structure.Then,the high-resolution features are obtained by fusing the ISAR image texture features and scattering quantization information of complex-valued echoes,thereby achieving significantly higher structural adaptability.Meanwhile,the scattering trait enhancement module and the statistical quantification module are designed.The edge texture is enhanced based on the scatter quantization property,which alleviates the segmentation challenge of edge blurring under low SNR conditions.The coupling of query/support samples is enhanced through four-dimensional convolution.Additionally,to overcome fusion challenges caused by information differences,multimodal feature fusion is guided by equilibrium comprehension loss.In this way,the performance potential of the fusion framework is fully unleashed,and the decision risk is effectively reduced.Experiments demonstrate the great advantages of the proposed framework in multimodal feature fusion,and it still exhibits great component segmentation capability under low SNR/edge blurring conditions.展开更多
Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing...Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.展开更多
Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting p...Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting prognosis.Existing methods struggle with large‐scale screening,multitracking,and over‐reliance on single‐modality data,ignoring the potential of multisource complementary information.Key technical challenges-effective data collection,multimodal feature extraction/fusion,and AI model construction-limit clinical application.Thus,exploring AI,new sensors,and existing data for efficient,fast,accurate,and radiation‐free preliminary diagnosis is crucial for timely treatment and improved outcomes.Methods:This study collected hematological data,and used fiber‐optic vibration sensors and audio sensors to capture heterogeneous signals of patients'lung respiration.Fiber‐optic respiratory frequency,audio‐respiratory rhythm,and hematological leukocyterelated features were extracted,optimized as multimodal inputs.The SCCA‐LMF fusion method generated fusion samples,which were input into an improved stacking ensemble learning model(including SVM,XGBoost,etc.)for binary classification.Results:The experiment included 360 actual samples(lung cancer:nonlung cancer=3.6:1)with complete data of 55-65‐yearold males and females.Predictive accuracy,sensitivity,specificity,and F1 score reached 97.70%,95.75%,99.64%,and 99.64%,respectively,outperforming existing independent LMF and TFN methods.This model effectively integrates respiratory vibration,audio signals,and routine blood tests.A multimodal feature grading fusion strategy was designed for 3D data analysis to comprehensively understand patient health and enhance prediction capabilities.All data and results are reproducible.Conclusion:This study demonstrates the method's potential for lung cancer preliminary identification,bridging medicine and engineering to improve healthcare outcomes.展开更多
Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces ...Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.展开更多
The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,th...The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.展开更多
Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,...Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,we develop a novel SOH prediction model integrating physical information constraints and multimodal feature fusion.Our approach employs a multi-channel encoder to process heterogeneous data modalities,including health indicators,raw charge/discharge sequences,and incremental capacity data,and uses multi-channel encoders to achieve structured input.A physics-informed loss function,derived from an empirical capacity decay equation,is incorporated to enforce interpretability,while a cross-layer attention mechanism dynamically weights features to handle missing modalities and random noise.Experimental validation on multiple battery types demonstrates that our model reduces mean absolute error(MAE)by at least 51.09%compared to unimodal baselines,maintains robustness under adverse conditions such as partial data loss,and achieves an average MAE of 0.0201 in real-world battery pack applications.This model significantly enhances the accuracy and universality of prediction,enabling accurate prediction of battery SOH under actual engineering conditions.展开更多
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t...Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.展开更多
Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate...Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.展开更多
Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability...Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability to fully extract fused image information.Therefore,a multimodal medical image fusion method based on mask optimization and parallel attention mechanism was proposed to address the aforementioned issues.Firstly,it converted the entire image into a binary mask,and constructed a contour feature map to maximize the contour feature information of the image and a triple path network for image texture detail feature extraction and optimization.Secondly,a contrast enhancement module and a detail preservation module were proposed to enhance the overall brightness and texture details of the image.Afterwards,a parallel attention mechanism was constructed using channel features and spatial feature changes to fuse images and enhance the salient information of the fused images.Finally,a decoupling network composed of residual networks was set up to optimize the information between the fused image and the source image so as to reduce information loss in the fused image.Compared with nine high-level methods proposed in recent years,the seven objective evaluation indicators of our method have improved by 6%−31%,indicating that this method can obtain fusion results with clearer texture details,higher contrast,and smaller pixel differences between the fused image and the source image.It is superior to other comparison algorithms in both subjective and objective indicators.展开更多
The utilization of digital picture search and retrieval has grown substantially in numerous fields for different purposes during the last decade,owing to the continuing advances in image processing and computer vision...The utilization of digital picture search and retrieval has grown substantially in numerous fields for different purposes during the last decade,owing to the continuing advances in image processing and computer vision approaches.In multiple real-life applications,for example,social media,content-based face picture retrieval is a well-invested technique for large-scale databases,where there is a significant necessity for reliable retrieval capabilities enabling quick search in a vast number of pictures.Humans widely employ faces for recognizing and identifying people.Thus,face recognition through formal or personal pictures is increasingly used in various real-life applications,such as helping crime investigators retrieve matching images from face image databases to identify victims and criminals.However,such face image retrieval becomes more challenging in large-scale databases,where traditional vision-based face analysis requires ample additional storage space than the raw face images already occupied to store extracted lengthy feature vectors and takes much longer to process and match thousands of face images.This work mainly contributes to enhancing face image retrieval performance in large-scale databases using hash codes inferred by locality-sensitive hashing(LSH)for facial hard and soft biometrics as(Hard BioHash)and(Soft BioHash),respectively,to be used as a search input for retrieving the top-k matching faces.Moreover,we propose the multi-biometric score-level fusion of both face hard and soft BioHashes(Hard-Soft BioHash Fusion)for further augmented face image retrieval.The experimental outcomes applied on the Labeled Faces in the Wild(LFW)dataset and the related attributes dataset(LFW-attributes),demonstrate that the retrieval performance of the suggested fusion approach(Hard-Soft BioHash Fusion)significantly improved the retrieval performance compared to solely using Hard BioHash or Soft BioHash in isolation,where the suggested method provides an augmented accuracy of 87%when executed on 1000 specimens and 77%on 5743 samples.These results remarkably outperform the results of the Hard BioHash method by(50%on the 1000 samples and 30%on the 5743 samples),and the Soft BioHash method by(78%on the 1000 samples and 63%on the 5743 samples).展开更多
The digital twin is the concept of transcending reality,which is the reverse feedback from the real physical space to the virtual digital space.People hold great prospects for this emerging technology.In order to real...The digital twin is the concept of transcending reality,which is the reverse feedback from the real physical space to the virtual digital space.People hold great prospects for this emerging technology.In order to realize the upgrading of the digital twin industrial chain,it is urgent to introduce more modalities,such as vision,haptics,hearing and smell,into the virtual digital space,which assists physical entities and virtual objects in creating a closer connection.Therefore,perceptual understanding and object recognition have become an urgent hot topic in the digital twin.Existing surface material classification schemes often achieve recognition through machine learning or deep learning in a single modality,ignoring the complementarity between multiple modalities.In order to overcome this dilemma,we propose a multimodal fusion network in our article that combines two modalities,visual and haptic,for surface material recognition.On the one hand,the network makes full use of the potential correlations between multiple modalities to deeply mine the modal semantics and complete the data mapping.On the other hand,the network is extensible and can be used as a universal architecture to include more modalities.Experiments show that the constructed multimodal fusion network can achieve 99.42%classification accuracy while reducing complexity.展开更多
3D vehicle detection based on LiDAR-camera fusion is becoming an emerging research topic in autonomous driving.The algorithm based on the Camera-LiDAR object candidate fusion method(CLOCs)is currently considered to be...3D vehicle detection based on LiDAR-camera fusion is becoming an emerging research topic in autonomous driving.The algorithm based on the Camera-LiDAR object candidate fusion method(CLOCs)is currently considered to be a more effective decision-level fusion algorithm,but it does not fully utilize the extracted features of 3D and 2D.Therefore,we proposed a 3D vehicle detection algorithm based onmultimodal decision-level fusion.First,project the anchor point of the 3D detection bounding box into the 2D image,calculate the distance between 2D and 3D anchor points,and use this distance as a new fusion feature to enhance the feature redundancy of the network.Subsequently,add an attention module:squeeze-and-excitation networks,weight each feature channel to enhance the important features of the network,and suppress useless features.The experimental results show that the mean average precision of the algorithm in the KITTI dataset is 82.96%,which outperforms previous state-ofthe-art multimodal fusion-based methods,and the average accuracy in the Easy,Moderate and Hard evaluation indicators reaches 88.96%,82.60%,and 77.31%,respectively,which are higher compared to the original CLOCs model by 1.02%,2.29%,and 0.41%,respectively.Compared with the original CLOCs algorithm,our algorithm has higher accuracy and better performance in 3D vehicle detection.展开更多
Natural events have had a significant impact on overall flight activity,and the aviation industry plays a vital role in helping society cope with the impact of these events.As one of the most impactful weather typhoon...Natural events have had a significant impact on overall flight activity,and the aviation industry plays a vital role in helping society cope with the impact of these events.As one of the most impactful weather typhoon seasons appears and continues,airlines operating in threatened areas and passengers having travel plans during this time period will pay close attention to the development of tropical storms.This paper proposes a deep multimodal fusion and multitasking trajectory prediction model that can improve the reliability of typhoon trajectory prediction and reduce the quantity of flight scheduling cancellation.The deep multimodal fusion module is formed by deep fusion of the feature output by multiple submodal fusion modules,and the multitask generation module uses longitude and latitude as two related tasks for simultaneous prediction.With more dependable data accuracy,problems can be analysed rapidly and more efficiently,enabling better decision-making with a proactive versus reactive posture.When multiple modalities coexist,features can be extracted from them simultaneously to supplement each other’s information.An actual case study,the typhoon Lichma that swept China in 2019,has demonstrated that the algorithm can effectively reduce the number of unnecessary flight cancellations compared to existing flight scheduling and assist the new generation of flight scheduling systems under extreme weather.展开更多
In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection ...In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.展开更多
This paper presents a low intricate,profoundly energy effective MRI Images combination intended for remote visual sensor frameworks which leads to improved understanding and implementation of treatment;especially for ...This paper presents a low intricate,profoundly energy effective MRI Images combination intended for remote visual sensor frameworks which leads to improved understanding and implementation of treatment;especially for radiology.This is done by combining the original picture which leads to a significant reduction in the computation time and frequency.The proposed technique conquers the calculation and energy impediment of low power tools and is examined as far as picture quality and energy is concerned.Reenactments are performed utilizing MATLAB 2018a,to quantify the resultant vitality investment funds and the reproduction results show that the proposed calculation is very quick and devours just around 1%of vitality decomposition by the hybrid combination plans.Likewise,the effortlessness of our proposed strategy makes it increasingly suitable for continuous applications.展开更多
Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside th...Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.展开更多
Weather radar echo extrapolation plays a crucial role in weather forecasting.However,traditional weather radar echo extrapolation methods are not very accurate and do not make full use of historical data.Deep learning...Weather radar echo extrapolation plays a crucial role in weather forecasting.However,traditional weather radar echo extrapolation methods are not very accurate and do not make full use of historical data.Deep learning algorithms based on Recurrent Neural Networks also have the problem of accumulating errors.Moreover,it is difficult to obtain higher accuracy by relying on a single historical radar echo observation.Therefore,in this study,we constructed the Fusion GRU module,which leverages a cascade structure to effectively combine radar echo data and mean wind data.We also designed the Top Connection so that the model can capture the global spatial relationship to construct constraints on the predictions.Based on the Jiangsu Province dataset,we compared some models.The results show that our proposed model,Cascade Fusion Spatiotemporal Network(CFSN),improved the critical success index(CSI)by 10.7%over the baseline at the threshold of 30 dBZ.Ablation experiments further validated the effectiveness of our model.Similarly,the CSI of the complete CFSN was 0.004 higher than the suboptimal solution without the cross-attention module at the threshold of 30 dBZ.展开更多
In this paper,we propose a new image fusion algorithm based on two-dimensional Scale-Mixing Complex Wavelet Transform(2D-SMCWT).The fusion of the detail 2D-SMCWT cofficients is performed via a Bayesian Maximum a Poste...In this paper,we propose a new image fusion algorithm based on two-dimensional Scale-Mixing Complex Wavelet Transform(2D-SMCWT).The fusion of the detail 2D-SMCWT cofficients is performed via a Bayesian Maximum a Posteriori(MAP)approach by considering a trivariate statistical model for the local neighboring of 2D-SMCWT coefficients.For the approx imation coefficients,a new fusion rule based on the Principal Component Analysis(PCA)is applied.We conduct several experiments using three different groups of multimodal medical images to evaluate the performance of the proposed method.The obt ained results prove the superiority of the proposed method over the state of the art fusion methods in terms of visual quality and several commonly used metrics.Robustness of the proposed method is further tested against different types of noise.The plots of fusion met rics establish the accuracy of the proposed fusion method.展开更多
基金supported by the National Key Research and Development Program of China(Grant No.2022YFC3004104)the National Natural Science Foundation of China(Grant No.U2342204)+4 种基金the Innovation and Development Program of the China Meteorological Administration(Grant No.CXFZ2024J001)the Open Research Project of the Key Open Laboratory of Hydrology and Meteorology of the China Meteorological Administration(Grant No.23SWQXZ010)the Science and Technology Plan Project of Zhejiang Province(Grant No.2022C03150)the Open Research Fund Project of Anyang National Climate Observatory(Grant No.AYNCOF202401)the Open Bidding for Selecting the Best Candidates Program(Grant No.CMAJBGS202318)。
文摘Thunderstorm wind gusts are small in scale,typically occurring within a range of a few kilometers.It is extremely challenging to monitor and forecast thunderstorm wind gusts using only automatic weather stations.Therefore,it is necessary to establish thunderstorm wind gust identification techniques based on multisource high-resolution observations.This paper introduces a new algorithm,called thunderstorm wind gust identification network(TGNet).It leverages multimodal feature fusion to fuse the temporal and spatial features of thunderstorm wind gust events.The shapelet transform is first used to extract the temporal features of wind speeds from automatic weather stations,which is aimed at distinguishing thunderstorm wind gusts from those caused by synoptic-scale systems or typhoons.Then,the encoder,structured upon the U-shaped network(U-Net)and incorporating recurrent residual convolutional blocks(R2U-Net),is employed to extract the corresponding spatial convective characteristics of satellite,radar,and lightning observations.Finally,by using the multimodal deep fusion module based on multi-head cross-attention,the temporal features of wind speed at each automatic weather station are incorporated into the spatial features to obtain 10-minutely classification of thunderstorm wind gusts.TGNet products have high accuracy,with a critical success index reaching 0.77.Compared with those of U-Net and R2U-Net,the false alarm rate of TGNet products decreases by 31.28%and 24.15%,respectively.The new algorithm provides grid products of thunderstorm wind gusts with a spatial resolution of 0.01°,updated every 10minutes.The results are finer and more accurate,thereby helping to improve the accuracy of operational warnings for thunderstorm wind gusts.
文摘Inverse Synthetic Aperture Radar(ISAR)images of complex targets have a low Signal-to-Noise Ratio(SNR)and contain fuzzy edges and large differences in scattering intensity,which limits the recognition performance of ISAR systems.Also,data scarcity poses a greater challenge to the accurate recognition of components.To address the issues of component recognition in complex ISAR targets,this paper adopts semantic segmentation and proposes a few-shot semantic segmentation framework fusing multimodal features.The scarcity of available data is mitigated by using a two-branch scattering feature encoding structure.Then,the high-resolution features are obtained by fusing the ISAR image texture features and scattering quantization information of complex-valued echoes,thereby achieving significantly higher structural adaptability.Meanwhile,the scattering trait enhancement module and the statistical quantification module are designed.The edge texture is enhanced based on the scatter quantization property,which alleviates the segmentation challenge of edge blurring under low SNR conditions.The coupling of query/support samples is enhanced through four-dimensional convolution.Additionally,to overcome fusion challenges caused by information differences,multimodal feature fusion is guided by equilibrium comprehension loss.In this way,the performance potential of the fusion framework is fully unleashed,and the decision risk is effectively reduced.Experiments demonstrate the great advantages of the proposed framework in multimodal feature fusion,and it still exhibits great component segmentation capability under low SNR/edge blurring conditions.
基金supported by the National Key Research and Development Program of China(Grant No.:2023YFC2605002)the National Key R&D Program of China(Grant No.:2022YFF1203003)+2 种基金Beijing AI Health Cultivation Project,China(Grant No.:Z221100003522022)the National Natural Science Foundation of China(Grant No.:82273772)the Beijing Natural Science Foundation,China(Grant No.:7212152).
文摘Accurate prediction of drug responses in cancer cell lines(CCLs)and transferable prediction of clinical drug responses using CCLs are two major tasks in personalized medicine.Despite the rapid advancements in existing computational methods for preclinical and clinical cancer drug response(CDR)prediction,challenges remain regarding the generalization of new drugs that are unseen in the training set.Herein,we propose a multimodal fusion deep learning(DL)model called drug-target and single-cell language based CDR(DTLCDR)to predict preclinical and clinical CDRs.The model integrates chemical descriptors,molecular graph representations,predicted protein target profiles of drugs,and cell line expression profiles with general knowledge from single cells.Among these features,a well-trained drug-target interaction(DTI)prediction model is used to generate target profiles of drugs,and a pretrained single-cell language model is integrated to provide general genomic knowledge.Comparison experiments on the cell line drug sensitivity dataset demonstrated that DTLCDR exhibited improved generalizability and robustness in predicting unseen drugs compared with previous state-of-the-art baseline methods.Further ablation studies verified the effectiveness of each component of our model,highlighting the significant contribution of target information to generalizability.Subsequently,the ability of DTLCDR to predict novel molecules was validated through in vitro cell experiments,demonstrating its potential for real-world applications.Moreover,DTLCDR was transferred to the clinical datasets,demonstrating satisfactory performance in the clinical data,regardless of whether the drugs were included in the cell line dataset.Overall,our results suggest that the DTLCDR is a promising tool for personalized drug discovery.
基金the Natural Science Foundation of Gansu Province(No.20JR10RA614,22YF7GA182,22JR11RA042,22JR5RA1006)the National Natural Science Foundation o Gansu Province(No.24CXGA024)+3 种基金the Industrial Support Plan for Higher Education Institutions in Gansu Province(No.CYZC-2024-10)the Open Fund of Key Laboratory of Time and Frequency Primary Standards,CAS,the Gansu Provincial University Industry Support Plan Project(2022CYZC-072022)the Lanzhou Chengguan District Science and Technology Plan Project(2021RCCX0031)Lanzhou Science and Technology Program(No.2024-4-38).
文摘Background:Current lung cancer initial diagnosis relies on experienced doctors combining imaging and biological indicators,but uneven medical resource distribution in China leads to delayed early diagnosis,affecting prognosis.Existing methods struggle with large‐scale screening,multitracking,and over‐reliance on single‐modality data,ignoring the potential of multisource complementary information.Key technical challenges-effective data collection,multimodal feature extraction/fusion,and AI model construction-limit clinical application.Thus,exploring AI,new sensors,and existing data for efficient,fast,accurate,and radiation‐free preliminary diagnosis is crucial for timely treatment and improved outcomes.Methods:This study collected hematological data,and used fiber‐optic vibration sensors and audio sensors to capture heterogeneous signals of patients'lung respiration.Fiber‐optic respiratory frequency,audio‐respiratory rhythm,and hematological leukocyterelated features were extracted,optimized as multimodal inputs.The SCCA‐LMF fusion method generated fusion samples,which were input into an improved stacking ensemble learning model(including SVM,XGBoost,etc.)for binary classification.Results:The experiment included 360 actual samples(lung cancer:nonlung cancer=3.6:1)with complete data of 55-65‐yearold males and females.Predictive accuracy,sensitivity,specificity,and F1 score reached 97.70%,95.75%,99.64%,and 99.64%,respectively,outperforming existing independent LMF and TFN methods.This model effectively integrates respiratory vibration,audio signals,and routine blood tests.A multimodal feature grading fusion strategy was designed for 3D data analysis to comprehensively understand patient health and enhance prediction capabilities.All data and results are reproducible.Conclusion:This study demonstrates the method's potential for lung cancer preliminary identification,bridging medicine and engineering to improve healthcare outcomes.
基金supported by the Funding for Research on the Evolution of Cyberbullying Incidents and Intervention Strategies(24BSH033)Discipline Innovation and Talent Introduction Bases in Higher Education Institutions(B20087).
文摘Hateful meme is a multimodal medium that combines images and texts.The potential hate content of hateful memes has caused serious problems for social media security.The current hateful memes classification task faces significant data scarcity challenges,and direct fine-tuning of large-scale pre-trained models often leads to severe overfitting issues.In addition,it is a challenge to understand the underlying relationship between text and images in the hateful memes.To address these issues,we propose a multimodal hateful memes classification model named LABF,which is based on low-rank adapter layers and bidirectional gated feature fusion.Firstly,low-rank adapter layers are adopted to learn the feature representation of the new dataset.This is achieved by introducing a small number of additional parameters while retaining prior knowledge of the CLIP model,which effectively alleviates the overfitting phenomenon.Secondly,a bidirectional gated feature fusion mechanism is designed to dynamically adjust the interaction weights of text and image features to achieve finer cross-modal fusion.Experimental results show that the method significantly outperforms existing methods on two public datasets,verifying its effectiveness and robustness.
基金supported by the National Natural Science Foundation of China under Grant 62107034the Major Science and Technology Project of Yunnan Province(202402AD080002)Yunnan International Joint R&D Center of China-Laos-Thailand Educational Digitalization(202203AP140006).
文摘The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.
基金Project(2023YFB2303704-07)supported by the National Natural Science Foundation of China。
文摘Accurate estimation of lithium battery state-of-health(SOH)is essential for ensuring safe operation and efficient utilization.To address the challenges of complex degradation factors and unreliable feature extraction,we develop a novel SOH prediction model integrating physical information constraints and multimodal feature fusion.Our approach employs a multi-channel encoder to process heterogeneous data modalities,including health indicators,raw charge/discharge sequences,and incremental capacity data,and uses multi-channel encoders to achieve structured input.A physics-informed loss function,derived from an empirical capacity decay equation,is incorporated to enforce interpretability,while a cross-layer attention mechanism dynamically weights features to handle missing modalities and random noise.Experimental validation on multiple battery types demonstrates that our model reduces mean absolute error(MAE)by at least 51.09%compared to unimodal baselines,maintains robustness under adverse conditions such as partial data loss,and achieves an average MAE of 0.0201 in real-world battery pack applications.This model significantly enhances the accuracy and universality of prediction,enabling accurate prediction of battery SOH under actual engineering conditions.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)National Natural Science Foundation of China(No.61903078)+1 种基金Shanghai Sailing Program,China(No.22YF1401300)Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
文摘Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.
文摘Visual question answering(VQA)is a multimodal task,involving a deep understanding of the image scene and the question’s meaning and capturing the relevant correlations between both modalities to infer the appropriate answer.In this paper,we propose a VQA system intended to answer yes/no questions about real-world images,in Arabic.To support a robust VQA system,we work in two directions:(1)Using deep neural networks to semantically represent the given image and question in a fine-grainedmanner,namely ResNet-152 and Gated Recurrent Units(GRU).(2)Studying the role of the utilizedmultimodal bilinear pooling fusion technique in the trade-o.between the model complexity and the overall model performance.Some fusion techniques could significantly increase the model complexity,which seriously limits their applicability for VQA models.So far,there is no evidence of how efficient these multimodal bilinear pooling fusion techniques are for VQA systems dedicated to yes/no questions.Hence,a comparative analysis is conducted between eight bilinear pooling fusion techniques,in terms of their ability to reduce themodel complexity and improve themodel performance in this case of VQA systems.Experiments indicate that these multimodal bilinear pooling fusion techniques have improved the VQA model’s performance,until reaching the best performance of 89.25%.Further,experiments have proven that the number of answers in the developed VQA system is a critical factor that a.ects the effectiveness of these multimodal bilinear pooling techniques in achieving their main objective of reducing the model complexity.The Multimodal Local Perception Bilinear Pooling(MLPB)technique has shown the best balance between the model complexity and its performance,for VQA systems designed to answer yes/no questions.
基金supported by Gansu Natural Science Foundation Programme(No.24JRRA231)National Natural Science Foundation of China(No.62061023)Gansu Provincial Education,Science and Technology Innovation and Industry(No.2021CYZC-04)。
文摘Medical image fusion technology is crucial for improving the detection accuracy and treatment efficiency of diseases,but existing fusion methods have problems such as blurred texture details,low contrast,and inability to fully extract fused image information.Therefore,a multimodal medical image fusion method based on mask optimization and parallel attention mechanism was proposed to address the aforementioned issues.Firstly,it converted the entire image into a binary mask,and constructed a contour feature map to maximize the contour feature information of the image and a triple path network for image texture detail feature extraction and optimization.Secondly,a contrast enhancement module and a detail preservation module were proposed to enhance the overall brightness and texture details of the image.Afterwards,a parallel attention mechanism was constructed using channel features and spatial feature changes to fuse images and enhance the salient information of the fused images.Finally,a decoupling network composed of residual networks was set up to optimize the information between the fused image and the source image so as to reduce information loss in the fused image.Compared with nine high-level methods proposed in recent years,the seven objective evaluation indicators of our method have improved by 6%−31%,indicating that this method can obtain fusion results with clearer texture details,higher contrast,and smaller pixel differences between the fused image and the source image.It is superior to other comparison algorithms in both subjective and objective indicators.
基金supported and funded by KAU Scientific Endowment,King Abdulaziz University,Jeddah,Saudi Arabia,grant number 077416-04.
文摘The utilization of digital picture search and retrieval has grown substantially in numerous fields for different purposes during the last decade,owing to the continuing advances in image processing and computer vision approaches.In multiple real-life applications,for example,social media,content-based face picture retrieval is a well-invested technique for large-scale databases,where there is a significant necessity for reliable retrieval capabilities enabling quick search in a vast number of pictures.Humans widely employ faces for recognizing and identifying people.Thus,face recognition through formal or personal pictures is increasingly used in various real-life applications,such as helping crime investigators retrieve matching images from face image databases to identify victims and criminals.However,such face image retrieval becomes more challenging in large-scale databases,where traditional vision-based face analysis requires ample additional storage space than the raw face images already occupied to store extracted lengthy feature vectors and takes much longer to process and match thousands of face images.This work mainly contributes to enhancing face image retrieval performance in large-scale databases using hash codes inferred by locality-sensitive hashing(LSH)for facial hard and soft biometrics as(Hard BioHash)and(Soft BioHash),respectively,to be used as a search input for retrieving the top-k matching faces.Moreover,we propose the multi-biometric score-level fusion of both face hard and soft BioHashes(Hard-Soft BioHash Fusion)for further augmented face image retrieval.The experimental outcomes applied on the Labeled Faces in the Wild(LFW)dataset and the related attributes dataset(LFW-attributes),demonstrate that the retrieval performance of the suggested fusion approach(Hard-Soft BioHash Fusion)significantly improved the retrieval performance compared to solely using Hard BioHash or Soft BioHash in isolation,where the suggested method provides an augmented accuracy of 87%when executed on 1000 specimens and 77%on 5743 samples.These results remarkably outperform the results of the Hard BioHash method by(50%on the 1000 samples and 30%on the 5743 samples),and the Soft BioHash method by(78%on the 1000 samples and 63%on the 5743 samples).
基金the National Natural Science Foundation of China(62001246,62001248,62171232)Key R&D Program of Jiangsu Province Key project and topics under Grant BE2021095+3 种基金the Natural Science Foundation of Jiangsu Province Higher Education Institutions(20KJB510020)the Future Network Scientific Research Fund Project(FNSRFP-2021-YB-16)the open research fund of Key Lab of Broadband Wireless Communication and Sensor Network Technology(JZNY202110)the NUPTSF under Grant(NY220070).
文摘The digital twin is the concept of transcending reality,which is the reverse feedback from the real physical space to the virtual digital space.People hold great prospects for this emerging technology.In order to realize the upgrading of the digital twin industrial chain,it is urgent to introduce more modalities,such as vision,haptics,hearing and smell,into the virtual digital space,which assists physical entities and virtual objects in creating a closer connection.Therefore,perceptual understanding and object recognition have become an urgent hot topic in the digital twin.Existing surface material classification schemes often achieve recognition through machine learning or deep learning in a single modality,ignoring the complementarity between multiple modalities.In order to overcome this dilemma,we propose a multimodal fusion network in our article that combines two modalities,visual and haptic,for surface material recognition.On the one hand,the network makes full use of the potential correlations between multiple modalities to deeply mine the modal semantics and complete the data mapping.On the other hand,the network is extensible and can be used as a universal architecture to include more modalities.Experiments show that the constructed multimodal fusion network can achieve 99.42%classification accuracy while reducing complexity.
基金supported by the Financial Support of the Key Research and Development Projects of Anhui (202104a05020003)the Natural Science Foundation of Anhui Province (2208085MF173)the Anhui Development and Reform Commission Supports R&D and Innovation Projects ([2020]479).
文摘3D vehicle detection based on LiDAR-camera fusion is becoming an emerging research topic in autonomous driving.The algorithm based on the Camera-LiDAR object candidate fusion method(CLOCs)is currently considered to be a more effective decision-level fusion algorithm,but it does not fully utilize the extracted features of 3D and 2D.Therefore,we proposed a 3D vehicle detection algorithm based onmultimodal decision-level fusion.First,project the anchor point of the 3D detection bounding box into the 2D image,calculate the distance between 2D and 3D anchor points,and use this distance as a new fusion feature to enhance the feature redundancy of the network.Subsequently,add an attention module:squeeze-and-excitation networks,weight each feature channel to enhance the important features of the network,and suppress useless features.The experimental results show that the mean average precision of the algorithm in the KITTI dataset is 82.96%,which outperforms previous state-ofthe-art multimodal fusion-based methods,and the average accuracy in the Easy,Moderate and Hard evaluation indicators reaches 88.96%,82.60%,and 77.31%,respectively,which are higher compared to the original CLOCs model by 1.02%,2.29%,and 0.41%,respectively.Compared with the original CLOCs algorithm,our algorithm has higher accuracy and better performance in 3D vehicle detection.
基金supported by the National Natural Science Foundation of China(62073330)。
文摘Natural events have had a significant impact on overall flight activity,and the aviation industry plays a vital role in helping society cope with the impact of these events.As one of the most impactful weather typhoon seasons appears and continues,airlines operating in threatened areas and passengers having travel plans during this time period will pay close attention to the development of tropical storms.This paper proposes a deep multimodal fusion and multitasking trajectory prediction model that can improve the reliability of typhoon trajectory prediction and reduce the quantity of flight scheduling cancellation.The deep multimodal fusion module is formed by deep fusion of the feature output by multiple submodal fusion modules,and the multitask generation module uses longitude and latitude as two related tasks for simultaneous prediction.With more dependable data accuracy,problems can be analysed rapidly and more efficiently,enabling better decision-making with a proactive versus reactive posture.When multiple modalities coexist,features can be extracted from them simultaneously to supplement each other’s information.An actual case study,the typhoon Lichma that swept China in 2019,has demonstrated that the algorithm can effectively reduce the number of unnecessary flight cancellations compared to existing flight scheduling and assist the new generation of flight scheduling systems under extreme weather.
基金The authors would like to thank the financial support of Natural Science Foundation of Anhui Province(No.2208085MF173)the key research and development projects of Anhui(202104a05020003)+2 种基金the anhui development and reform commission supports R&D and innovation project([2020]479)the national natural science foundation of China(51575001)Anhui university scientific research platform innovation team building project(2016-2018).
文摘In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.
文摘This paper presents a low intricate,profoundly energy effective MRI Images combination intended for remote visual sensor frameworks which leads to improved understanding and implementation of treatment;especially for radiology.This is done by combining the original picture which leads to a significant reduction in the computation time and frequency.The proposed technique conquers the calculation and energy impediment of low power tools and is examined as far as picture quality and energy is concerned.Reenactments are performed utilizing MATLAB 2018a,to quantify the resultant vitality investment funds and the reproduction results show that the proposed calculation is very quick and devours just around 1%of vitality decomposition by the hybrid combination plans.Likewise,the effortlessness of our proposed strategy makes it increasingly suitable for continuous applications.
基金supported in part by a fund from Bentley Systems,Inc.
文摘Recent advances in computer vision and deep learning have shown that the fusion of depth information can significantly enhance the performance of RGB-based damage detection and segmentation models.However,alongside the advantages,depth-sensing also presents many practical challenges.For instance,the depth sensors impose an additional payload burden on the robotic inspection platforms limiting the operation time and increasing the inspection cost.Additionally,some lidar-based depth sensors have poor outdoor performance due to sunlight contamination during the daytime.In this context,this study investigates the feasibility of abolishing depth-sensing at test time without compromising the segmentation performance.An autonomous damage segmentation framework is developed,based on recent advancements in vision-based multi-modal sensing such as modality hallucination(MH)and monocular depth estimation(MDE),which require depth data only during the model training.At the time of deployment,depth data becomes expendable as it can be simulated from the corresponding RGB frames.This makes it possible to reap the benefits of depth fusion without any depth perception per se.This study explored two different depth encoding techniques and three different fusion strategies in addition to a baseline RGB-based model.The proposed approach is validated on computer-generated RGB-D data of reinforced concrete buildings subjected to seismic damage.It was observed that the surrogate techniques can increase the segmentation IoU by up to 20.1%with a negligible increase in the computation cost.Overall,this study is believed to make a positive contribution to enhancing the resilience of critical civil infrastructure.
基金National Natural Science Foundation of China(42375145)The Open Grants of China Meteorological Admin-istration Radar Meteorology Key Laboratory(2023LRM-A02)。
文摘Weather radar echo extrapolation plays a crucial role in weather forecasting.However,traditional weather radar echo extrapolation methods are not very accurate and do not make full use of historical data.Deep learning algorithms based on Recurrent Neural Networks also have the problem of accumulating errors.Moreover,it is difficult to obtain higher accuracy by relying on a single historical radar echo observation.Therefore,in this study,we constructed the Fusion GRU module,which leverages a cascade structure to effectively combine radar echo data and mean wind data.We also designed the Top Connection so that the model can capture the global spatial relationship to construct constraints on the predictions.Based on the Jiangsu Province dataset,we compared some models.The results show that our proposed model,Cascade Fusion Spatiotemporal Network(CFSN),improved the critical success index(CSI)by 10.7%over the baseline at the threshold of 30 dBZ.Ablation experiments further validated the effectiveness of our model.Similarly,the CSI of the complete CFSN was 0.004 higher than the suboptimal solution without the cross-attention module at the threshold of 30 dBZ.
文摘In this paper,we propose a new image fusion algorithm based on two-dimensional Scale-Mixing Complex Wavelet Transform(2D-SMCWT).The fusion of the detail 2D-SMCWT cofficients is performed via a Bayesian Maximum a Posteriori(MAP)approach by considering a trivariate statistical model for the local neighboring of 2D-SMCWT coefficients.For the approx imation coefficients,a new fusion rule based on the Principal Component Analysis(PCA)is applied.We conduct several experiments using three different groups of multimodal medical images to evaluate the performance of the proposed method.The obt ained results prove the superiority of the proposed method over the state of the art fusion methods in terms of visual quality and several commonly used metrics.Robustness of the proposed method is further tested against different types of noise.The plots of fusion met rics establish the accuracy of the proposed fusion method.