Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become availa...Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.展开更多
Background Gesture recognition has attracted significant attention because of its wide range of potential applications.Although multi-modal gesture recognition has made significant progress in recent years,a popular m...Background Gesture recognition has attracted significant attention because of its wide range of potential applications.Although multi-modal gesture recognition has made significant progress in recent years,a popular method still is simply fusing prediction scores at the end of each branch,which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.Methods This paper proposes an Adaptive Cross-modal Weighting(ACmW)scheme to exploit complementarity features from RGB-D data in this study.The scheme learns relations among different modalities by combining the features of different data streams.The proposed ACmW module contains two key functions:(1)fusing complementary features from multiple streams through an adaptive one-dimensional convolution;and(2)modeling the correlation of multi-stream complementary features in the time dimension.Through the effective combination of these two functional modules,the proposed ACmW can automatically analyze the relationship between the complementary features from different streams,and can fuse them in the spatial and temporal dimensions.Results Extensive experiments validate the effectiveness of the proposed method,and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.展开更多
Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrai...Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrained hardware or lose accuracy under occlusion,vibration,and lighting changes.We introduce Industrial EdgeSign,a dual-path framework that combines hardware-aware neural architecture search(NAS)with large multimodalmodel(LMM)guided semantics to deliver robust,low-latency gesture recognition on edge devices.The searched model uses a truncated ResNet50 front end,a dimensional-reduction network that preserves spatiotemporal structure for tubelet-based attention,and localized Transformer layers tuned for on-device inference.To reduce reliance on gloss annotations and mitigate domain shift,we distill semantics from factory-tuned vision-language models and pre-train with masked language modeling and video-text contrastive objectives,aligning visual features with a shared text space.OnML2HP and SHREC’17,theNAS-derived architecture attains 94.7% accuracywith 86ms inference latency and about 5.9W power on Jetson Nano.Under occlusion,lighting shifts,andmotion blur,accuracy remains above 82%.For safetycritical commands,the emergency-stop gesture achieves 72 ms 99th percentile latency with 99.7% fail-safe triggering.Ablation studies confirm the contribution of the spatiotemporal tubelet extractor and text-side pre-training,and we observe gains in translation quality(BLEU-422.33).These results show that Industrial EdgeSign provides accurate,resource-aware,and safety-aligned gesture recognition suitable for deployment in smart factory settings.展开更多
Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combini...Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.展开更多
The fasteners employed in the railway tracks are susceptible to defects arising from their intricate composition.Foreign objects are frequently observed on the track bed in an open environment.These two types of defec...The fasteners employed in the railway tracks are susceptible to defects arising from their intricate composition.Foreign objects are frequently observed on the track bed in an open environment.These two types of defects pose potential threats to high-speed trains,thus necessitating timely and accurate track inspection.The majority of extant automatic inspection methods are predicated on the utilization of single visible light data,and the efficacy of the algorithmic processes is influenced by complex environments.Furthermore,due to the single information dimension,the detection accuracy of defects in similar,occluded,and small object categories is low.To address the aforementioned issues,this paper proposes a track defect detectionmethod based on dynamicmulti-modal fusion and challenging object enhanced perception.First,in light of the variances in the representation dimensions ofmultimodal information,this paper proposes a dynamic weighted multi-modal feature fusion module.The fused multi-modal features are assigned weights,and thenmultiplied with the extracted single-modal features atmultiple levels,achieving adaptive adjustment of the response degree of fusion features.Second,a novel stepwise multi-scale convolution feature aggregation module is proposed for challenging objects.The proposed method employs depth separable convolution and cross-scale aggregation operations of different receptive fields to enhance feature extraction and reuse,thereby reducing the degree of progressive loss of effective information.The experimental results demonstrate the efficacy of the proposed method in comparison to eight established methods,encompassing both single-modal and multi-modal methods,as evidenced by the extensive findings within the constructed RGBD dataset.展开更多
Autism spectrum disorder(AsD)is a highly heterogeneous neurodevelopmental disorder.Early diagnosis and intervention are crucial for improving outcomes.Traditional single-modality diagnostic methods are subjective,limi...Autism spectrum disorder(AsD)is a highly heterogeneous neurodevelopmental disorder.Early diagnosis and intervention are crucial for improving outcomes.Traditional single-modality diagnostic methods are subjective,limited,and struggle to reveal the underlying pathological mechanisms.In contrast,multimodal data analysis integrates behavioral,physiological,and neuroimaging information with advanced machine-learning and deeplearning algorithms to overcome these limitations.In this review,we surveyed the recent pediatric AsD literature,highlighting artificial intelligence-driven diagnostic techniques,multimodal data fusion strategies,and emerging trends in ASD assessment.We surveyed studies that integrated two or more modalities and summarized the fusion levels,learning paradigms,tasks,datasets,and metrics.Multimodal approaches outperform singlemodality baselines in classification,severity estimation,and subtyping by leveraging complementary information and reducing modality-specific biases.Multimodal approaches significantly enhance diagnostic accuracy and comprehensiveness,enabling early screening of AsD,symptom subtyping,severity assessment,and personalized interventions.Advances in multimodal fusion techniques have promoted progress in precision medicine for the treatment of ASD.展开更多
In multi-modal emotion recognition,excessive reliance on historical context often impedes the detection of emotional shifts,while modality heterogeneity and unimodal noise limit recognition performance.Existing method...In multi-modal emotion recognition,excessive reliance on historical context often impedes the detection of emotional shifts,while modality heterogeneity and unimodal noise limit recognition performance.Existing methods struggle to dynamically adjust cross-modal complementary strength to optimize fusion quality and lack effective mechanisms to model the dynamic evolution of emotions.To address these issues,we propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition.A dynamic gating mechanism is applied across unimodal encoding,cross-modal alignment,and emotion transfer modeling,substantially improving noise robustness and feature alignment.First,we construct a unimodal encoder based on gated recurrent units and feature-selection gating to suppress intra-modal noise and enhance contextual representation.Second,we design a gated-attention crossmodal encoder that dynamically calibrates the complementary contributions of visual and audio modalities to the dominant textual features and eliminates redundant information.Finally,we introduce a gated enhanced emotion transfer module that explicitly models the temporal dependence of emotional evolution in dialogues via transfer gating and optimizes continuity modeling with a comparative learning loss.Experimental results demonstrate that the proposed method outperforms state-of-the-art models on the public MELD and IEMOCAP datasets.展开更多
To address the challenge of achieving decentralized,scalable,and adaptive control for large-scale multiple unmanned aerial vehicle(multi-UAV)swarms in dynamic urban environments with obstacles and wind perturbations,w...To address the challenge of achieving decentralized,scalable,and adaptive control for large-scale multiple unmanned aerial vehicle(multi-UAV)swarms in dynamic urban environments with obstacles and wind perturbations,we proposed a hybrid framework integrating adaptive reinforcement learning(RL),multi-modal perception fusion,and enhanced pigeon flock optimization(PFO)with curiosity-driven exploration to enable robust autonomous and formation control.The framework leverages meta-learning to optimize RL policies for real-time adaptation,fuses sensor data for precise state estimation,and enhances PFO with learned leader-follower dynamics and exploration rewards to maintain cohesive formations and explore uncertain areas.For swarms of 10–30 UAVs,it achieves 34%faster convergence,61%reduced stability root mean square error(RMSE),88%fewer collisions and 85.6%–92.3%success rates in target detection and encirclement,outperforming standard multi-agent RL,pure PFO,and single-modality RL.Three-dimensional trajectory visualizations confirm cohesive formations,collision-free maneuvers,and efficient exploration in urban search-and-rescue scenarios.Innovations include meta-RL for rapid adaptation,multi-modal fusion for robust perception,and curiosity-driven PFO for scalable,decentralized control,advancing real-world multi-UAV swarm autonomy and coordination.展开更多
In gesture recognition,static gestures,dynamic gestures and trajectory gestures are collectively known as multi-modal gestures.To solve the existing problem in different recognition methods for different modal gesture...In gesture recognition,static gestures,dynamic gestures and trajectory gestures are collectively known as multi-modal gestures.To solve the existing problem in different recognition methods for different modal gestures,a unified recognition algorithm is proposed.The angle change data of the finger joints and the movement of the centroid of the hand were acquired respectively by data glove and Kinect.Through the preprocessing of the multi-source heterogeneous data,all hand gestures were considered as curves while solving hand shaking,and a uniform hand gesture recognition algorithm was established to calculate the Pearson correlation coefficient between hand gestures for gesture recognition.In this way,complex gesture recognition was transformed into the problem of a simple comparison of curves similarities.The main innovations:1) Aiming at solving the problem of multi-modal gesture recognition,an unified recognition model and a new algorithm is proposed;2) The Pearson correlation coefficient for the first time to construct the gesture similarity operator is improved.By testing 50 kinds of gestures,the experimental results showed that the method presented could cope with intricate gesture interaction with the 97.7% recognition rate.展开更多
Based on the teaching video of middle school English teachers, through observation and analysis, it puts forward the problem of less use, wrong use and abuse in the use of teachers' teaching gestures in middle sch...Based on the teaching video of middle school English teachers, through observation and analysis, it puts forward the problem of less use, wrong use and abuse in the use of teachers' teaching gestures in middle school English teaching. And then it puts forward corresponding solutions from three aspects: concept, theory and practice. Hoping to provide further reference to the complementary role of teaching gesture and teaching discourse.展开更多
Objective To develop a non-invasive predictive model for coronary artery stenosis severity based on adaptive multi-modal integration of traditional Chinese and western medicine data.Methods Clinical indicators,echocar...Objective To develop a non-invasive predictive model for coronary artery stenosis severity based on adaptive multi-modal integration of traditional Chinese and western medicine data.Methods Clinical indicators,echocardiographic data,traditional Chinese medicine(TCM)tongue manifestations,and facial features were collected from patients who underwent coro-nary computed tomography angiography(CTA)in the Cardiac Care Unit(CCU)of Shanghai Tenth People's Hospital between May 1,2023 and May 1,2024.An adaptive weighted multi-modal data fusion(AWMDF)model based on deep learning was constructed to predict the severity of coronary artery stenosis.The model was evaluated using metrics including accura-cy,precision,recall,F1 score,and the area under the receiver operating characteristic(ROC)curve(AUC).Further performance assessment was conducted through comparisons with six ensemble machine learning methods,data ablation,model component ablation,and various decision-level fusion strategies.Results A total of 158 patients were included in the study.The AWMDF model achieved ex-cellent predictive performance(AUC=0.973,accuracy=0.937,precision=0.937,recall=0.929,and F1 score=0.933).Compared with model ablation,data ablation experiments,and various traditional machine learning models,the AWMDF model demonstrated superior per-formance.Moreover,the adaptive weighting strategy outperformed alternative approaches,including simple weighting,averaging,voting,and fixed-weight schemes.Conclusion The AWMDF model demonstrates potential clinical value in the non-invasive prediction of coronary artery disease and could serve as a tool for clinical decision support.展开更多
Traditional Chinese medicine(TCM)demonstrates distinctive advantages in disease prevention and treatment.However,analyzing its biological mechanisms through the modern medical research paradigm of“single drug,single ...Traditional Chinese medicine(TCM)demonstrates distinctive advantages in disease prevention and treatment.However,analyzing its biological mechanisms through the modern medical research paradigm of“single drug,single target”presents significant challenges due to its holistic approach.Network pharmacology and its core theory of network targets connect drugs and diseases from a holistic and systematic perspective based on biological networks,overcoming the limitations of reductionist research models and showing considerable value in TCM research.Recent integration of network target computational and experimental methods with artificial intelligence(AI)and multi-modal multi-omics technologies has substantially enhanced network pharmacology methodology.The advancement in computational and experimental techniques provides complementary support for network target theory in decoding TCM principles.This review,centered on network targets,examines the progress of network target methods combined with AI in predicting disease molecular mechanisms and drug-target relationships,alongside the application of multi-modal multi-omics technologies in analyzing TCM formulae,syndromes,and toxicity.Looking forward,network target theory is expected to incorporate emerging technologies while developing novel approaches aligned with its unique characteristics,potentially leading to significant breakthroughs in TCM research and advancing scientific understanding and innovation in TCM.展开更多
The multi-modal characteristics of mineral particles play a pivotal role in enhancing the classification accuracy,which is critical for obtaining a profound understanding of the Earth's composition and ensuring ef...The multi-modal characteristics of mineral particles play a pivotal role in enhancing the classification accuracy,which is critical for obtaining a profound understanding of the Earth's composition and ensuring effective exploitation utilization of its resources.However,the existing methods for classifying mineral particles do not fully utilize these multi-modal features,thereby limiting the classification accuracy.Furthermore,when conventional multi-modal image classification methods are applied to planepolarized and cross-polarized sequence images of mineral particles,they encounter issues such as information loss,misaligned features,and challenges in spatiotemporal feature extraction.To address these challenges,we propose a multi-modal mineral particle polarization image classification network(MMGC-Net)for precise mineral particle classification.Initially,MMGC-Net employs a two-dimensional(2D)backbone network with shared parameters to extract features from two types of polarized images to ensure feature alignment.Subsequently,a cross-polarized intra-modal feature fusion module is designed to refine the spatiotemporal features from the extracted features of the cross-polarized sequence images.Ultimately,the inter-modal feature fusion module integrates the two types of modal features to enhance the classification precision.Quantitative and qualitative experimental results indicate that when compared with the current state-of-the-art multi-modal image classification methods,MMGC-Net demonstrates marked superiority in terms of mineral particle multi-modal feature learning and four classification evaluation metrics.It also demonstrates better stability than the existing models.展开更多
With the advent of the next-generation Air Traffic Control(ATC)system,there is growing interest in using Artificial Intelligence(AI)techniques to enhance Situation Awareness(SA)for ATC Controllers(ATCOs),i.e.,Intellig...With the advent of the next-generation Air Traffic Control(ATC)system,there is growing interest in using Artificial Intelligence(AI)techniques to enhance Situation Awareness(SA)for ATC Controllers(ATCOs),i.e.,Intelligent SA(ISA).However,the existing AI-based SA approaches often rely on unimodal data and lack a comprehensive description and benchmark of the ISA tasks utilizing multi-modal data for real-time ATC environments.To address this gap,by analyzing the situation awareness procedure of the ATCOs,the ISA task is refined to the processing of the two primary elements,i.e.,spoken instructions and flight trajectories.Subsequently,the ISA is further formulated into Controlling Intent Understanding(CIU)and Flight Trajectory Prediction(FTP)tasks.For the CIU task,an innovative automatic speech recognition and understanding framework is designed to extract the controlling intent from unstructured and continuous ATC communications.For the FTP task,the single-and multi-horizon FTP approaches are investigated to support the high-precision prediction of the situation evolution.A total of 32 unimodal/multi-modal advanced methods with extensive evaluation metrics are introduced to conduct the benchmarks on the real-world multi-modal ATC situation dataset.Experimental results demonstrate the effectiveness of AI-based techniques in enhancing ISA for the ATC environment.展开更多
A personalized outfit recommendation has emerged as a hot research topic in the fashion domain.However,existing recommendations do not fully exploit user style preferences.Typically,users prefer particular styles such...A personalized outfit recommendation has emerged as a hot research topic in the fashion domain.However,existing recommendations do not fully exploit user style preferences.Typically,users prefer particular styles such as casual and athletic styles,and consider attributes like color and texture when selecting outfits.To achieve personalized outfit recommendations in line with user style preferences,this paper proposes a personal style guided outfit recommendation with multi-modal fashion compatibility modeling,termed as PSGNet.Firstly,a style classifier is designed to categorize fashion images of various clothing types and attributes into distinct style categories.Secondly,a personal style prediction module extracts user style preferences by analyzing historical data.Then,to address the limitations of single-modal representations and enhance fashion compatibility,both fashion images and text data are leveraged to extract multi-modal features.Finally,PSGNet integrates these components through Bayesian personalized ranking(BPR)to unify the personal style and fashion compatibility,where the former is used as personal style features and guides the output of the personalized outfit recommendation tailored to the target user.Extensive experiments on large-scale datasets demonstrate that the proposed model is efficient on the personalized outfit recommendation.展开更多
This article describes a pilot study aiming at generating social interactions between a humanoid robot and adolescents with autism spectrum disorder (ASD), through the practice of a gesture imitation game. The partici...This article describes a pilot study aiming at generating social interactions between a humanoid robot and adolescents with autism spectrum disorder (ASD), through the practice of a gesture imitation game. The participants were a 17-year-old young lady with ASD and intellectual deficit, and a control participant: a preadolescent with ASD but no intellectual deficit (Asperger syndrome). The game is comprised of four phases: greetings, pairing, imitation, and closing. Field educators were involved, playing specific roles: visual or physical inciter. The use of a robot allows for catching the participants’ attention, playing the imitation game for a longer period of time than with a human partner, and preventing the game partner’s negative facial expressions resulting from tiredness, impatience, or boredom. The participants’ behavior was observed in terms of initial approach towards the robot, positioning relative to the robot in terms of distance and orientation, reactions to the robot’s voice or moves, signs of happiness, and imitation attempts. Results suggest a more and more natural approach towards the robot during the sessions, as well as a higher level of social interaction, based on the variations of the parameters listed above. We use these preliminary results to draw the next steps of our research work as well as identify further perspectives, with this aim in mind: improving social interactions with adolescents with ASD and intellectual deficit, allowing for better integration of these people into our societies.展开更多
The rapid evolution of virtual reality(VR)and augmented reality(AR)technologies has significantly transformed human-computer interaction,with applications spanning entertainment,education,healthcare,industry,and remot...The rapid evolution of virtual reality(VR)and augmented reality(AR)technologies has significantly transformed human-computer interaction,with applications spanning entertainment,education,healthcare,industry,and remote collaboration.A central challenge in these immersive systems lies in enabling intuitive,efficient,and natural interactions.Hand gesture recognition offers a compelling solution by leveraging the expressiveness of human hands to facilitate seamless control without relying on traditional input devices such as controllers or keyboards,which can limit immersion.However,achieving robust gesture recognition requires overcoming challenges related to accurate hand tracking,complex environmental conditions,and minimizing system latency.This study proposes an artificial intelligence(AI)-driven framework for recognizing both static and dynamic hand gestures in VR and AR environments using skeleton-based tracking compliant with the OpenXR standard.Our approach employs a lightweight neural network architecture capable of real-time classification within approximately 1.3mswhilemaintaining average accuracy of 95%.We also introduce a novel dataset generation method to support training robust models and demonstrate consistent classification of diverse gestures across widespread commercial VR devices.This work represents one of the first studies to implement and validate dynamic hand gesture recognition in real time using standardized VR hardware,laying the groundwork for more immersive,accessible,and user-friendly interaction systems.By advancing AI-driven gesture interfaces,this research has the potential to broaden the adoption of VR and AR across diverse domains and enhance the overall user experience.展开更多
Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or ...Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.展开更多
Multi-modal knowledge graph completion(MMKGC)aims to complete missing entities or relations in multi-modal knowledge graphs,thereby discovering more previously unknown triples.Due to the continuous growth of data and ...Multi-modal knowledge graph completion(MMKGC)aims to complete missing entities or relations in multi-modal knowledge graphs,thereby discovering more previously unknown triples.Due to the continuous growth of data and knowledge and the limitations of data sources,the visual knowledge within the knowledge graphs is generally of low quality,and some entities suffer from the issue of missing visual modality.Nevertheless,previous studies of MMKGC have primarily focused on how to facilitate modality interaction and fusion while neglecting the problems of low modality quality and modality missing.In this case,mainstream MMKGC models only use pre-trained visual encoders to extract features and transfer the semantic information to the joint embeddings through modal fusion,which inevitably suffers from problems such as error propagation and increased uncertainty.To address these problems,we propose a Multi-modal knowledge graph Completion model based on Super-resolution and Detailed Description Generation(MMCSD).Specifically,we leverage a pre-trained residual network to enhance the resolution and improve the quality of the visual modality.Moreover,we design multi-level visual semantic extraction and entity description generation,thereby further extracting entity semantics from structural triples and visual images.Meanwhile,we train a variational multi-modal auto-encoder and utilize a pre-trained multi-modal language model to complement the missing visual features.We conducted experiments on FB15K-237 and DB13K,and the results showed that MMCSD can effectively perform MMKGC and achieve state-of-the-art performance.展开更多
Integrating multiple medical imaging techniques,including Magnetic Resonance Imaging(MRI),Computed Tomography,Positron Emission Tomography(PET),and ultrasound,provides a comprehensive view of the patient health status...Integrating multiple medical imaging techniques,including Magnetic Resonance Imaging(MRI),Computed Tomography,Positron Emission Tomography(PET),and ultrasound,provides a comprehensive view of the patient health status.Each of these methods contributes unique diagnostic insights,enhancing the overall assessment of patient condition.Nevertheless,the amalgamation of data from multiple modalities presents difficulties due to disparities in resolution,data collection methods,and noise levels.While traditional models like Convolutional Neural Networks(CNNs)excel in single-modality tasks,they struggle to handle multi-modal complexities,lacking the capacity to model global relationships.This research presents a novel approach for examining multi-modal medical imagery using a transformer-based system.The framework employs self-attention and cross-attention mechanisms to synchronize and integrate features across various modalities.Additionally,it shows resilience to variations in noise and image quality,making it adaptable for real-time clinical use.To address the computational hurdles linked to transformer models,particularly in real-time clinical applications in resource-constrained environments,several optimization techniques have been integrated to boost scalability and efficiency.Initially,a streamlined transformer architecture was adopted to minimize the computational load while maintaining model effectiveness.Methods such as model pruning,quantization,and knowledge distillation have been applied to reduce the parameter count and enhance the inference speed.Furthermore,efficient attention mechanisms such as linear or sparse attention were employed to alleviate the substantial memory and processing requirements of traditional self-attention operations.For further deployment optimization,researchers have implemented hardware-aware acceleration strategies,including the use of TensorRT and ONNX-based model compression,to ensure efficient execution on edge devices.These optimizations allow the approach to function effectively in real-time clinical settings,ensuring viability even in environments with limited resources.Future research directions include integrating non-imaging data to facilitate personalized treatment and enhancing computational efficiency for implementation in resource-limited environments.This study highlights the transformative potential of transformer models in multi-modal medical imaging,offering improvements in diagnostic accuracy and patient care outcomes.展开更多
基金Supported by Grant-in-Aid for Young Scientists(A)(Grant No.26700021)Japan Society for the Promotion of Science and Strategic Information and Communications R&D Promotion Programme(Grant No.142103011)Ministry of Internal Affairs and Communications
文摘Gesture recognition is used in many practical applications such as human-robot interaction, medical rehabilitation and sign language. With increasing motion sensor development, multiple data sources have become available, which leads to the rise of multi-modal gesture recognition. Since our previous approach to gesture recognition depends on a unimodal system, it is difficult to classify similar motion patterns. In order to solve this problem, a novel approach which integrates motion, audio and video models is proposed by using dataset captured by Kinect. The proposed system can recognize observed gestures by using three models. Recognition results of three models are integrated by using the proposed framework and the output becomes the final result. The motion and audio models are learned by using Hidden Markov Model. Random Forest which is the video classifier is used to learn the video model. In the experiments to test the performances of the proposed system, the motion and audio models most suitable for gesture recognition are chosen by varying feature vectors and learning methods. Additionally, the unimodal and multi-modal models are compared with respect to recognition accuracy. All the experiments are conducted on dataset provided by the competition organizer of MMGRC, which is a workshop for Multi-Modal Gesture Recognition Challenge. The comparison results show that the multi-modal model composed of three models scores the highest recognition rate. This improvement of recognition accuracy means that the complementary relationship among three models improves the accuracy of gesture recognition. The proposed system provides the application technology to understand human actions of daily life more precisely.
基金the Chinese National Natural Science Foundation Projects(61961160704,61876179)the Key Project of the General Logistics Department(ASW17C001)the Science and Technology Development Fund of Macao(0010/2019/AFJ,0025/2019/AKP).
文摘Background Gesture recognition has attracted significant attention because of its wide range of potential applications.Although multi-modal gesture recognition has made significant progress in recent years,a popular method still is simply fusing prediction scores at the end of each branch,which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.Methods This paper proposes an Adaptive Cross-modal Weighting(ACmW)scheme to exploit complementarity features from RGB-D data in this study.The scheme learns relations among different modalities by combining the features of different data streams.The proposed ACmW module contains two key functions:(1)fusing complementary features from multiple streams through an adaptive one-dimensional convolution;and(2)modeling the correlation of multi-stream complementary features in the time dimension.Through the effective combination of these two functional modules,the proposed ACmW can automatically analyze the relationship between the complementary features from different streams,and can fuse them in the spatial and temporal dimensions.Results Extensive experiments validate the effectiveness of the proposed method,and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.
文摘Industrial operators need reliable communication in high-noise,safety-critical environments where speech or touch input is often impractical.Existing gesture systems either miss real-time deadlines on resourceconstrained hardware or lose accuracy under occlusion,vibration,and lighting changes.We introduce Industrial EdgeSign,a dual-path framework that combines hardware-aware neural architecture search(NAS)with large multimodalmodel(LMM)guided semantics to deliver robust,low-latency gesture recognition on edge devices.The searched model uses a truncated ResNet50 front end,a dimensional-reduction network that preserves spatiotemporal structure for tubelet-based attention,and localized Transformer layers tuned for on-device inference.To reduce reliance on gloss annotations and mitigate domain shift,we distill semantics from factory-tuned vision-language models and pre-train with masked language modeling and video-text contrastive objectives,aligning visual features with a shared text space.OnML2HP and SHREC’17,theNAS-derived architecture attains 94.7% accuracywith 86ms inference latency and about 5.9W power on Jetson Nano.Under occlusion,lighting shifts,andmotion blur,accuracy remains above 82%.For safetycritical commands,the emergency-stop gesture achieves 72 ms 99th percentile latency with 99.7% fail-safe triggering.Ablation studies confirm the contribution of the spatiotemporal tubelet extractor and text-side pre-training,and we observe gains in translation quality(BLEU-422.33).These results show that Industrial EdgeSign provides accurate,resource-aware,and safety-aligned gesture recognition suitable for deployment in smart factory settings.
基金funded by the Natural Science Foundation of Chongqing Municipality,grant number CSTB2022NSCQ-MSX0503.
文摘Gait recognition is a key biometric for long-distance identification,yet its performance is severely degraded by real-world challenges such as varying clothing,carrying conditions,and changing viewpoints.While combining silhouette and skeleton data is a promising direction,effectively fusing these heterogeneous modalities and adaptively weighting their contributions in response to diverse conditions remains a central problem.This paper introduces GaitMAFF,a novelMulti-modal Adaptive Feature Fusion Network,to address this challenge.Our approach first transforms discrete skeleton joints into a dense SkeletonMap representation to align with silhouettes,then employs an attention-based module to dynamically learn the fusion weights between the two modalities.These fused features are processed by a powerful spatio-temporal backbone withWeighted Global-Local Feature FusionModules(WFFM)to learn a discriminative representation.Extensive experiments on the challenging CCPG and Gait3D datasets show that GaitMAFF achieves state-of-the-art performance,with an average Rank-1 accuracy of 84.6%on CCPG and 58.7%on Gait3D.These results demonstrate that our adaptive fusion strategy effectively integrates complementary multimodal information,significantly enhancing gait recognition robustness and accuracy in complex scenes and providing a practical solution for real-world applications.
基金funded by Beijing Natural Science Foundation,grant number L241078.
文摘The fasteners employed in the railway tracks are susceptible to defects arising from their intricate composition.Foreign objects are frequently observed on the track bed in an open environment.These two types of defects pose potential threats to high-speed trains,thus necessitating timely and accurate track inspection.The majority of extant automatic inspection methods are predicated on the utilization of single visible light data,and the efficacy of the algorithmic processes is influenced by complex environments.Furthermore,due to the single information dimension,the detection accuracy of defects in similar,occluded,and small object categories is low.To address the aforementioned issues,this paper proposes a track defect detectionmethod based on dynamicmulti-modal fusion and challenging object enhanced perception.First,in light of the variances in the representation dimensions ofmultimodal information,this paper proposes a dynamic weighted multi-modal feature fusion module.The fused multi-modal features are assigned weights,and thenmultiplied with the extracted single-modal features atmultiple levels,achieving adaptive adjustment of the response degree of fusion features.Second,a novel stepwise multi-scale convolution feature aggregation module is proposed for challenging objects.The proposed method employs depth separable convolution and cross-scale aggregation operations of different receptive fields to enhance feature extraction and reuse,thereby reducing the degree of progressive loss of effective information.The experimental results demonstrate the efficacy of the proposed method in comparison to eight established methods,encompassing both single-modal and multi-modal methods,as evidenced by the extensive findings within the constructed RGBD dataset.
基金supported by the National Key Research and Development Program of China(Research Grant Number:2023YFC3603600).
文摘Autism spectrum disorder(AsD)is a highly heterogeneous neurodevelopmental disorder.Early diagnosis and intervention are crucial for improving outcomes.Traditional single-modality diagnostic methods are subjective,limited,and struggle to reveal the underlying pathological mechanisms.In contrast,multimodal data analysis integrates behavioral,physiological,and neuroimaging information with advanced machine-learning and deeplearning algorithms to overcome these limitations.In this review,we surveyed the recent pediatric AsD literature,highlighting artificial intelligence-driven diagnostic techniques,multimodal data fusion strategies,and emerging trends in ASD assessment.We surveyed studies that integrated two or more modalities and summarized the fusion levels,learning paradigms,tasks,datasets,and metrics.Multimodal approaches outperform singlemodality baselines in classification,severity estimation,and subtyping by leveraging complementary information and reducing modality-specific biases.Multimodal approaches significantly enhance diagnostic accuracy and comprehensiveness,enabling early screening of AsD,symptom subtyping,severity assessment,and personalized interventions.Advances in multimodal fusion techniques have promoted progress in precision medicine for the treatment of ASD.
基金funded by“the Fanying Special Program of the National Natural Science Foundation of China,grant number 62341307”“the Scientific research project of Jiangxi Provincial Department of Education,grant number GJJ200839”“theDoctoral startup fund of JiangxiUniversity of Technology,grant number 205200100402”.
文摘In multi-modal emotion recognition,excessive reliance on historical context often impedes the detection of emotional shifts,while modality heterogeneity and unimodal noise limit recognition performance.Existing methods struggle to dynamically adjust cross-modal complementary strength to optimize fusion quality and lack effective mechanisms to model the dynamic evolution of emotions.To address these issues,we propose a multi-level dynamic gating and emotion transfer framework for multi-modal emotion recognition.A dynamic gating mechanism is applied across unimodal encoding,cross-modal alignment,and emotion transfer modeling,substantially improving noise robustness and feature alignment.First,we construct a unimodal encoder based on gated recurrent units and feature-selection gating to suppress intra-modal noise and enhance contextual representation.Second,we design a gated-attention crossmodal encoder that dynamically calibrates the complementary contributions of visual and audio modalities to the dominant textual features and eliminates redundant information.Finally,we introduce a gated enhanced emotion transfer module that explicitly models the temporal dependence of emotional evolution in dialogues via transfer gating and optimizes continuity modeling with a comparative learning loss.Experimental results demonstrate that the proposed method outperforms state-of-the-art models on the public MELD and IEMOCAP datasets.
基金supported by the National Natural Science Foundation of China(No.62350048)。
文摘To address the challenge of achieving decentralized,scalable,and adaptive control for large-scale multiple unmanned aerial vehicle(multi-UAV)swarms in dynamic urban environments with obstacles and wind perturbations,we proposed a hybrid framework integrating adaptive reinforcement learning(RL),multi-modal perception fusion,and enhanced pigeon flock optimization(PFO)with curiosity-driven exploration to enable robust autonomous and formation control.The framework leverages meta-learning to optimize RL policies for real-time adaptation,fuses sensor data for precise state estimation,and enhances PFO with learned leader-follower dynamics and exploration rewards to maintain cohesive formations and explore uncertain areas.For swarms of 10–30 UAVs,it achieves 34%faster convergence,61%reduced stability root mean square error(RMSE),88%fewer collisions and 85.6%–92.3%success rates in target detection and encirclement,outperforming standard multi-agent RL,pure PFO,and single-modality RL.Three-dimensional trajectory visualizations confirm cohesive formations,collision-free maneuvers,and efficient exploration in urban search-and-rescue scenarios.Innovations include meta-RL for rapid adaptation,multi-modal fusion for robust perception,and curiosity-driven PFO for scalable,decentralized control,advancing real-world multi-UAV swarm autonomy and coordination.
基金supported by the National Key R&D Program of China(2018YFB1004901)the National Natural Science Foundation of China(61472163,61472232)
文摘In gesture recognition,static gestures,dynamic gestures and trajectory gestures are collectively known as multi-modal gestures.To solve the existing problem in different recognition methods for different modal gestures,a unified recognition algorithm is proposed.The angle change data of the finger joints and the movement of the centroid of the hand were acquired respectively by data glove and Kinect.Through the preprocessing of the multi-source heterogeneous data,all hand gestures were considered as curves while solving hand shaking,and a uniform hand gesture recognition algorithm was established to calculate the Pearson correlation coefficient between hand gestures for gesture recognition.In this way,complex gesture recognition was transformed into the problem of a simple comparison of curves similarities.The main innovations:1) Aiming at solving the problem of multi-modal gesture recognition,an unified recognition model and a new algorithm is proposed;2) The Pearson correlation coefficient for the first time to construct the gesture similarity operator is improved.By testing 50 kinds of gestures,the experimental results showed that the method presented could cope with intricate gesture interaction with the 97.7% recognition rate.
文摘Based on the teaching video of middle school English teachers, through observation and analysis, it puts forward the problem of less use, wrong use and abuse in the use of teachers' teaching gestures in middle school English teaching. And then it puts forward corresponding solutions from three aspects: concept, theory and practice. Hoping to provide further reference to the complementary role of teaching gesture and teaching discourse.
基金Construction Program of the Key Discipline of State Administration of Traditional Chinese Medicine of China(ZYYZDXK-2023069)Research Project of Shanghai Municipal Health Commission (2024QN018)Shanghai University of Traditional Chinese Medicine Science and Technology Development Program (23KFL005)。
文摘Objective To develop a non-invasive predictive model for coronary artery stenosis severity based on adaptive multi-modal integration of traditional Chinese and western medicine data.Methods Clinical indicators,echocardiographic data,traditional Chinese medicine(TCM)tongue manifestations,and facial features were collected from patients who underwent coro-nary computed tomography angiography(CTA)in the Cardiac Care Unit(CCU)of Shanghai Tenth People's Hospital between May 1,2023 and May 1,2024.An adaptive weighted multi-modal data fusion(AWMDF)model based on deep learning was constructed to predict the severity of coronary artery stenosis.The model was evaluated using metrics including accura-cy,precision,recall,F1 score,and the area under the receiver operating characteristic(ROC)curve(AUC).Further performance assessment was conducted through comparisons with six ensemble machine learning methods,data ablation,model component ablation,and various decision-level fusion strategies.Results A total of 158 patients were included in the study.The AWMDF model achieved ex-cellent predictive performance(AUC=0.973,accuracy=0.937,precision=0.937,recall=0.929,and F1 score=0.933).Compared with model ablation,data ablation experiments,and various traditional machine learning models,the AWMDF model demonstrated superior per-formance.Moreover,the adaptive weighting strategy outperformed alternative approaches,including simple weighting,averaging,voting,and fixed-weight schemes.Conclusion The AWMDF model demonstrates potential clinical value in the non-invasive prediction of coronary artery disease and could serve as a tool for clinical decision support.
文摘Traditional Chinese medicine(TCM)demonstrates distinctive advantages in disease prevention and treatment.However,analyzing its biological mechanisms through the modern medical research paradigm of“single drug,single target”presents significant challenges due to its holistic approach.Network pharmacology and its core theory of network targets connect drugs and diseases from a holistic and systematic perspective based on biological networks,overcoming the limitations of reductionist research models and showing considerable value in TCM research.Recent integration of network target computational and experimental methods with artificial intelligence(AI)and multi-modal multi-omics technologies has substantially enhanced network pharmacology methodology.The advancement in computational and experimental techniques provides complementary support for network target theory in decoding TCM principles.This review,centered on network targets,examines the progress of network target methods combined with AI in predicting disease molecular mechanisms and drug-target relationships,alongside the application of multi-modal multi-omics technologies in analyzing TCM formulae,syndromes,and toxicity.Looking forward,network target theory is expected to incorporate emerging technologies while developing novel approaches aligned with its unique characteristics,potentially leading to significant breakthroughs in TCM research and advancing scientific understanding and innovation in TCM.
基金supported by the National Natural Science Foundation of China(Grant Nos.62071315 and 62271336).
文摘The multi-modal characteristics of mineral particles play a pivotal role in enhancing the classification accuracy,which is critical for obtaining a profound understanding of the Earth's composition and ensuring effective exploitation utilization of its resources.However,the existing methods for classifying mineral particles do not fully utilize these multi-modal features,thereby limiting the classification accuracy.Furthermore,when conventional multi-modal image classification methods are applied to planepolarized and cross-polarized sequence images of mineral particles,they encounter issues such as information loss,misaligned features,and challenges in spatiotemporal feature extraction.To address these challenges,we propose a multi-modal mineral particle polarization image classification network(MMGC-Net)for precise mineral particle classification.Initially,MMGC-Net employs a two-dimensional(2D)backbone network with shared parameters to extract features from two types of polarized images to ensure feature alignment.Subsequently,a cross-polarized intra-modal feature fusion module is designed to refine the spatiotemporal features from the extracted features of the cross-polarized sequence images.Ultimately,the inter-modal feature fusion module integrates the two types of modal features to enhance the classification precision.Quantitative and qualitative experimental results indicate that when compared with the current state-of-the-art multi-modal image classification methods,MMGC-Net demonstrates marked superiority in terms of mineral particle multi-modal feature learning and four classification evaluation metrics.It also demonstrates better stability than the existing models.
基金supported by the National Natural Science Foundation of China(Nos.62371323,62401380,U2433217,U2333209,and U20A20161)Natural Science Foundation of Sichuan Province,China(Nos.2025ZNSFSC1476)+2 种基金Sichuan Science and Technology Program,China(Nos.2024YFG0010 and 2024ZDZX0046)the Institutional Research Fund from Sichuan University(Nos.2024SCUQJTX030)the Open Fund of Key Laboratory of Flight Techniques and Flight Safety,CAAC(Nos.GY2024-01A).
文摘With the advent of the next-generation Air Traffic Control(ATC)system,there is growing interest in using Artificial Intelligence(AI)techniques to enhance Situation Awareness(SA)for ATC Controllers(ATCOs),i.e.,Intelligent SA(ISA).However,the existing AI-based SA approaches often rely on unimodal data and lack a comprehensive description and benchmark of the ISA tasks utilizing multi-modal data for real-time ATC environments.To address this gap,by analyzing the situation awareness procedure of the ATCOs,the ISA task is refined to the processing of the two primary elements,i.e.,spoken instructions and flight trajectories.Subsequently,the ISA is further formulated into Controlling Intent Understanding(CIU)and Flight Trajectory Prediction(FTP)tasks.For the CIU task,an innovative automatic speech recognition and understanding framework is designed to extract the controlling intent from unstructured and continuous ATC communications.For the FTP task,the single-and multi-horizon FTP approaches are investigated to support the high-precision prediction of the situation evolution.A total of 32 unimodal/multi-modal advanced methods with extensive evaluation metrics are introduced to conduct the benchmarks on the real-world multi-modal ATC situation dataset.Experimental results demonstrate the effectiveness of AI-based techniques in enhancing ISA for the ATC environment.
基金Shanghai Frontier Science Research Center for Modern Textiles,Donghua University,ChinaOpen Project of Henan Key Laboratory of Intelligent Manufacturing of Mechanical Equipment,Zhengzhou University of Light Industry,China(No.IM202303)National Key Research and Development Program of China(No.2019YFB1706300)。
文摘A personalized outfit recommendation has emerged as a hot research topic in the fashion domain.However,existing recommendations do not fully exploit user style preferences.Typically,users prefer particular styles such as casual and athletic styles,and consider attributes like color and texture when selecting outfits.To achieve personalized outfit recommendations in line with user style preferences,this paper proposes a personal style guided outfit recommendation with multi-modal fashion compatibility modeling,termed as PSGNet.Firstly,a style classifier is designed to categorize fashion images of various clothing types and attributes into distinct style categories.Secondly,a personal style prediction module extracts user style preferences by analyzing historical data.Then,to address the limitations of single-modal representations and enhance fashion compatibility,both fashion images and text data are leveraged to extract multi-modal features.Finally,PSGNet integrates these components through Bayesian personalized ranking(BPR)to unify the personal style and fashion compatibility,where the former is used as personal style features and guides the output of the personalized outfit recommendation tailored to the target user.Extensive experiments on large-scale datasets demonstrate that the proposed model is efficient on the personalized outfit recommendation.
文摘This article describes a pilot study aiming at generating social interactions between a humanoid robot and adolescents with autism spectrum disorder (ASD), through the practice of a gesture imitation game. The participants were a 17-year-old young lady with ASD and intellectual deficit, and a control participant: a preadolescent with ASD but no intellectual deficit (Asperger syndrome). The game is comprised of four phases: greetings, pairing, imitation, and closing. Field educators were involved, playing specific roles: visual or physical inciter. The use of a robot allows for catching the participants’ attention, playing the imitation game for a longer period of time than with a human partner, and preventing the game partner’s negative facial expressions resulting from tiredness, impatience, or boredom. The participants’ behavior was observed in terms of initial approach towards the robot, positioning relative to the robot in terms of distance and orientation, reactions to the robot’s voice or moves, signs of happiness, and imitation attempts. Results suggest a more and more natural approach towards the robot during the sessions, as well as a higher level of social interaction, based on the variations of the parameters listed above. We use these preliminary results to draw the next steps of our research work as well as identify further perspectives, with this aim in mind: improving social interactions with adolescents with ASD and intellectual deficit, allowing for better integration of these people into our societies.
基金supported by research fund from Chosun University,2024.
文摘The rapid evolution of virtual reality(VR)and augmented reality(AR)technologies has significantly transformed human-computer interaction,with applications spanning entertainment,education,healthcare,industry,and remote collaboration.A central challenge in these immersive systems lies in enabling intuitive,efficient,and natural interactions.Hand gesture recognition offers a compelling solution by leveraging the expressiveness of human hands to facilitate seamless control without relying on traditional input devices such as controllers or keyboards,which can limit immersion.However,achieving robust gesture recognition requires overcoming challenges related to accurate hand tracking,complex environmental conditions,and minimizing system latency.This study proposes an artificial intelligence(AI)-driven framework for recognizing both static and dynamic hand gestures in VR and AR environments using skeleton-based tracking compliant with the OpenXR standard.Our approach employs a lightweight neural network architecture capable of real-time classification within approximately 1.3mswhilemaintaining average accuracy of 95%.We also introduce a novel dataset generation method to support training robust models and demonstrate consistent classification of diverse gestures across widespread commercial VR devices.This work represents one of the first studies to implement and validate dynamic hand gesture recognition in real time using standardized VR hardware,laying the groundwork for more immersive,accessible,and user-friendly interaction systems.By advancing AI-driven gesture interfaces,this research has the potential to broaden the adoption of VR and AR across diverse domains and enhance the overall user experience.
基金funded by Research Project,grant number BHQ090003000X03.
文摘Multi-modal Named Entity Recognition(MNER)aims to better identify meaningful textual entities by integrating information from images.Previous work has focused on extracting visual semantics at a fine-grained level,or obtaining entity related external knowledge from knowledge bases or Large Language Models(LLMs).However,these approaches ignore the poor semantic correlation between visual and textual modalities in MNER datasets and do not explore different multi-modal fusion approaches.In this paper,we present MMAVK,a multi-modal named entity recognition model with auxiliary visual knowledge and word-level fusion,which aims to leverage the Multi-modal Large Language Model(MLLM)as an implicit knowledge base.It also extracts vision-based auxiliary knowledge from the image formore accurate and effective recognition.Specifically,we propose vision-based auxiliary knowledge generation,which guides the MLLM to extract external knowledge exclusively derived from images to aid entity recognition by designing target-specific prompts,thus avoiding redundant recognition and cognitive confusion caused by the simultaneous processing of image-text pairs.Furthermore,we employ a word-level multi-modal fusion mechanism to fuse the extracted external knowledge with each word-embedding embedded from the transformerbased encoder.Extensive experimental results demonstrate that MMAVK outperforms or equals the state-of-the-art methods on the two classical MNER datasets,even when the largemodels employed have significantly fewer parameters than other baselines.
基金funded by Research Project,grant number BHQ090003000X03。
文摘Multi-modal knowledge graph completion(MMKGC)aims to complete missing entities or relations in multi-modal knowledge graphs,thereby discovering more previously unknown triples.Due to the continuous growth of data and knowledge and the limitations of data sources,the visual knowledge within the knowledge graphs is generally of low quality,and some entities suffer from the issue of missing visual modality.Nevertheless,previous studies of MMKGC have primarily focused on how to facilitate modality interaction and fusion while neglecting the problems of low modality quality and modality missing.In this case,mainstream MMKGC models only use pre-trained visual encoders to extract features and transfer the semantic information to the joint embeddings through modal fusion,which inevitably suffers from problems such as error propagation and increased uncertainty.To address these problems,we propose a Multi-modal knowledge graph Completion model based on Super-resolution and Detailed Description Generation(MMCSD).Specifically,we leverage a pre-trained residual network to enhance the resolution and improve the quality of the visual modality.Moreover,we design multi-level visual semantic extraction and entity description generation,thereby further extracting entity semantics from structural triples and visual images.Meanwhile,we train a variational multi-modal auto-encoder and utilize a pre-trained multi-modal language model to complement the missing visual features.We conducted experiments on FB15K-237 and DB13K,and the results showed that MMCSD can effectively perform MMKGC and achieve state-of-the-art performance.
基金supported by the Deanship of Research and Graduate Studies at King Khalid University under Small Research Project grant number RGP1/139/45.
文摘Integrating multiple medical imaging techniques,including Magnetic Resonance Imaging(MRI),Computed Tomography,Positron Emission Tomography(PET),and ultrasound,provides a comprehensive view of the patient health status.Each of these methods contributes unique diagnostic insights,enhancing the overall assessment of patient condition.Nevertheless,the amalgamation of data from multiple modalities presents difficulties due to disparities in resolution,data collection methods,and noise levels.While traditional models like Convolutional Neural Networks(CNNs)excel in single-modality tasks,they struggle to handle multi-modal complexities,lacking the capacity to model global relationships.This research presents a novel approach for examining multi-modal medical imagery using a transformer-based system.The framework employs self-attention and cross-attention mechanisms to synchronize and integrate features across various modalities.Additionally,it shows resilience to variations in noise and image quality,making it adaptable for real-time clinical use.To address the computational hurdles linked to transformer models,particularly in real-time clinical applications in resource-constrained environments,several optimization techniques have been integrated to boost scalability and efficiency.Initially,a streamlined transformer architecture was adopted to minimize the computational load while maintaining model effectiveness.Methods such as model pruning,quantization,and knowledge distillation have been applied to reduce the parameter count and enhance the inference speed.Furthermore,efficient attention mechanisms such as linear or sparse attention were employed to alleviate the substantial memory and processing requirements of traditional self-attention operations.For further deployment optimization,researchers have implemented hardware-aware acceleration strategies,including the use of TensorRT and ONNX-based model compression,to ensure efficient execution on edge devices.These optimizations allow the approach to function effectively in real-time clinical settings,ensuring viability even in environments with limited resources.Future research directions include integrating non-imaging data to facilitate personalized treatment and enhancing computational efficiency for implementation in resource-limited environments.This study highlights the transformative potential of transformer models in multi-modal medical imaging,offering improvements in diagnostic accuracy and patient care outcomes.