To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, key...To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, keypoints and descriptors based on cross-modality fusion are proposed and applied to the study of camera motion estimation. A convolutional neural network is used to detect the positions of keypoints and generate the corresponding descriptors, and the pyramid convolution is used to extract multi-scale features in the network. The problem of local similarity of images is solved by capturing local and global feature information and fusing the geometric position information of keypoints to generate descriptors. According to our experiments, the repeatability of our method is improved by 3.7%, and the homography estimation is improved by 1.6%. To demonstrate the practicability of the method, the visual odometry part of simultaneous localization and mapping is constructed and our method is 35% higher positioning accuracy than the traditional method.展开更多
Person re-identification(ReID)is a sub-problem under image retrieval.It is a technology that uses computer vision to identify a specific pedestrian in a collection of pictures or videos.The pedestrian image under cros...Person re-identification(ReID)is a sub-problem under image retrieval.It is a technology that uses computer vision to identify a specific pedestrian in a collection of pictures or videos.The pedestrian image under cross-device is taken from a monitored pedestrian image.At present,most ReID methods deal with the matching between visible and visible images,but with the continuous improvement of security monitoring system,more and more infrared cameras are used to monitor at night or in dim light.Due to the image differences between infrared camera and RGB camera,there is a huge visual difference between cross-modality images,so the traditional ReID method is difficult to apply in this scene.In view of this situation,studying the pedestrian matching between visible and infrared modalities is particularly crucial.Visible-infrared person re-identification(VI-ReID)was first proposed in 2017,and then attracted more and more attention,and many advanced methods emerged.展开更多
Recent advancements in deep learning(DL)have propelled the virtual transformation of microscopy images across optical modalities,enabling unprecedented multimodal imaging analysis hitherto impossible.Despite these str...Recent advancements in deep learning(DL)have propelled the virtual transformation of microscopy images across optical modalities,enabling unprecedented multimodal imaging analysis hitherto impossible.Despite these strides,the integration of such algorithms into scientists’daily routines and clinical trials remains limited,largely due to a lack of recognition within their respective fields and the plethora of available transformation methods.To address this,we present a structured overview of cross-modality transformations,encompassing applications,data sets,and implementations,aimed at unifying this evolving field.Our review focuses on DL solutions for two key applications:contrast enhancement of targeted features within images and resolution enhancements.We recognize cross-modality transformations as a valuable resource for biologists seeking a deeper understanding of the field,as well as for technology developers aiming to better grasp sample limitations and potential applications.Notably,they enable high-contrast,high-specificity imaging akin to fluorescence microscopy without the need for laborious,costly,and disruptive physical-staining procedures.In addition,they facilitate the realization of imaging with properties that would typically require costly or complex physical modifications,such as achieving superresolution capabilities.By consolidating the current state of research in this review,we aim to catalyze further investigation and development,ultimately bringing the potential of cross-modality transformations into the hands of researchers and clinicians alike.展开更多
Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and comp...Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation.However,these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples,which are crucial for robust association modeling.To address these challenges,we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities.First,we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments.This shared pre-training step ensures the extraction of complementary identity information across modalities.Subsequently,we introduce a cross-modal simplex center loss,which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere.This design enforces an equidistant and balanced distribution of identity embeddings,reducing intra-class variations.Furthermore,we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability,enhancing the model’s ability to generalize across challenging scenarios.Extensive experiments validate the effectiveness of our framework,demonstrating superior performance across various speech-face association tasks,including matching,verification,and retrieval.Notably,in the challenging gender-constrained matching task,our method achieves a remarkable accuracy of 79.22%,significantly outperforming existing approaches.These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.展开更多
Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed ...Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).展开更多
Detecting surface defects on unused rails is crucial for evaluating rail quality and durability to ensure the safety of rail transportation.However,existing detection methods often struggle with challenges such as com...Detecting surface defects on unused rails is crucial for evaluating rail quality and durability to ensure the safety of rail transportation.However,existing detection methods often struggle with challenges such as complex defect morphology,texture similarity,and fuzzy edges,leading to poor accuracy and missed detections.In order to resolve these problems,we propose MSCM-Net(Multi-Scale Cross-Modal Network),a multiscale cross-modal framework focused on detecting rail surface defects.MSCM-Net introduces an attention mechanism to dynamically weight the fusion of RGB and depth maps,effectively capturing and enhancing features at different scales for each modality.To further enrich feature representation and improve edge detection in blurred areas,we propose a multi-scale void fusion module that integrates multi-scale feature information.To improve cross-modal feature fusion,we develop a cross-enhanced fusion module that transfers fused features between layers to incorporate interlayer information.We also introduce a multimodal feature integration module,which merges modality-specific features from separate decoders into a shared decoder,enhancing detection by leveraging richer complementary information.Finally,we validate MSCM-Net on the NEU RSDDS-AUG RGB-depth dataset,comparing it against 12 leading methods,and the results show that MSCM-Net achieves superior performance on all metrics.展开更多
With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of...With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of multimodal approaches for fake news detection has gained significant attention.To solve the problems existing in previous multi-modal fake news detection algorithms,such as insufficient feature extraction and insufficient use of semantic relations between modes,this paper proposes the MFFFND-Co(Multimodal Feature Fusion Fake News Detection with Co-Attention Block)model.First,the model deeply explores the textual content,image content,and frequency domain features.Then,it employs a Co-Attention mechanism for cross-modal fusion.Additionally,a semantic consistency detectionmodule is designed to quantify semantic deviations,thereby enhancing the performance of fake news detection.Experimentally verified on two commonly used datasets,Twitter and Weibo,the model achieved F1 scores of 90.0% and 94.0%,respectively,significantly outperforming the pre-modified MFFFND(Multimodal Feature Fusion Fake News Detection with Attention Block)model and surpassing other baseline models.This improves the accuracy of detecting fake information in artificial intelligence detection and engineering software detection.展开更多
With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to mult...With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to multimodalinformation exchange and fusion, with many methods attempting to integrate unimodal features to generatemultimodal news representations. However, they still need to fully explore the hierarchical and complex semanticcorrelations between different modal contents, severely limiting their performance detecting multimodal falseinformation. This work proposes a two-stage detection framework for multimodal false information detection,called ASMFD, which is based on image aesthetic similarity to segment and explores the consistency andinconsistency features of images and texts. Specifically, we first use the Contrastive Language-Image Pre-training(CLIP) model to learn the relationship between text and images through label awareness and train an imageaesthetic attribute scorer using an aesthetic attribute dataset. Then, we calculate the aesthetic similarity betweenthe image and related images and use this similarity as a threshold to divide the multimodal correlation matrixinto consistency and inconsistencymatrices. Finally, the fusionmodule is designed to identify essential features fordetectingmultimodal false information. In extensive experiments on four datasets, the performance of the ASMFDis superior to state-of-the-art baseline methods.展开更多
Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fu...Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fusion method does not utilize the correlation information between modalities.To solve this problem,this paper proposes amodel based on amulti-head attention mechanism.First,after preprocessing the original data.Then,the feature representation is converted into a sequence of word vectors and positional encoding is introduced to better understand the semantic and sequential information in the input sequence.Next,the input coding sequence is fed into the transformer model for further processing and learning.At the transformer layer,a cross-modal attention consisting of a pair of multi-head attention modules is employed to reflect the correlation between modalities.Finally,the processed results are input into the feedforward neural network to obtain the emotional output through the classification layer.Through the above processing flow,the model can capture semantic information and contextual relationships and achieve good results in various natural language processing tasks.Our model was tested on the CMU Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI)and Multimodal EmotionLines Dataset(MELD),achieving an accuracy of 82.04% and F1 parameters reached 80.59% on the former dataset.展开更多
The cross-modal person re-identification task aims to match visible and infrared images of the same individual.The main challenges in this field arise from significant modality differences between individuals and the ...The cross-modal person re-identification task aims to match visible and infrared images of the same individual.The main challenges in this field arise from significant modality differences between individuals and the lack of high-quality cross-modal correspondence methods.Existing approaches often attempt to establish modality correspondence by extracting shared features across different modalities.However,these methods tend to focus on local information extraction and fail to fully leverage the global identity information in the cross-modal features,resulting in limited correspondence accuracy and suboptimal matching performance.To address this issue,we propose a quadratic graph matching method designed to overcome the challenges posed by modality differences through precise cross-modal relationship alignment.This method transforms the cross-modal correspondence problem into a graph matching task and minimizes the matching cost using a center search mechanism.Building on this approach,we further design a block reasoning module to uncover latent relationships between person identities and optimize the modality correspondence results.The block strategy not only improves the efficiency of updating gallery images but also enhances matching accuracy while reducing computational load.Experimental results demonstrate that our proposed method outperforms the state-of-the-art methods on the SYSU-MM01,RegDB,and RGBNT201 datasets,achieving excellent matching accuracy and robustness,thereby validating its effectiveness in cross-modal person re-identification.展开更多
Medical image analysis based on deep learning has become an important technical requirement in the field of smart healthcare.In view of the difficulties in collaborative modeling of local details and global features i...Medical image analysis based on deep learning has become an important technical requirement in the field of smart healthcare.In view of the difficulties in collaborative modeling of local details and global features in multimodal image analysis of ophthalmology,as well as the existence of information redundancy in cross-modal data fusion,this paper proposes amultimodal fusion framework based on cross-modal collaboration and weighted attention mechanism.In terms of feature extraction,the framework collaboratively extracts local fine-grained features and global structural dependencies through a parallel dual-branch architecture,overcoming the limitations of traditional single-modality models in capturing either local or global information;in terms of fusion strategy,the framework innovatively designs a cross-modal dynamic fusion strategy,combining overlappingmulti-head self-attention modules with a bidirectional feature alignment mechanism,addressing the bottlenecks of low feature interaction efficiency and excessive attention fusion computations in traditional parallel fusion,and further introduces cross-domain local integration technology,which enhances the representation ability of the lesion area through pixel-level feature recalibration and optimizes the diagnostic robustness of complex cases.Experiments show that the framework exhibits excellent feature expression and generalization performance in cross-domain scenarios of ophthalmic medical images and natural images,providing a high-precision,low-redundancy fusion paradigm for multimodal medical image analysis,and promoting the upgrade of intelligent diagnosis and treatment fromsingle-modal static analysis to dynamic decision-making.展开更多
Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasi...Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.展开更多
Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due...Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach.展开更多
Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its contr...Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its control were not fully explored and the mechanisms remain poorly understood.This mini review investigated the impact of uni-modal and multi-modal information on the taste perception,from the perspective of cognitive status,such as emotion,expectation and attention,and discussed the hypothesis that the cognitive status is the key step for visual sense to exert influence on taste.This work may help researchers better understand the mechanism of cross-modal information processing and further develop neutrally-based artificial intelligent(AI)system.展开更多
In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)...In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.展开更多
Cross-modal semantic mapping and cross-media retrieval are key problems of the multimedia search engine.This study analyzes the hierarchy,the functionality,and the structure in the visual and auditory sensations of co...Cross-modal semantic mapping and cross-media retrieval are key problems of the multimedia search engine.This study analyzes the hierarchy,the functionality,and the structure in the visual and auditory sensations of cognitive system,and establishes a brain-like cross-modal semantic mapping framework based on cognitive computing of visual and auditory sensations.The mechanism of visual-auditory multisensory integration,selective attention in thalamo-cortical,emotional control in limbic system and the memory-enhancing in hippocampal were considered in the framework.Then,the algorithms of cross-modal semantic mapping were given.Experimental results show that the framework can be effectively applied to the cross-modal semantic mapping,and also provides an important significance for brain-like computing of non-von Neumann structure.展开更多
Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input...Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input is completely absent from birth or very early in life(Kupers and Ptito,2014).More recent studies explored what happens to the visual pathways in the context of acquired blindness.This is particularly relevant as the majority of diseases that cause vision loss occur in the elderly.展开更多
In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalm...In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.展开更多
In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res...In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.展开更多
In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and un...In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.展开更多
基金Supported by the National Natural Science Foundation of China (61802253)。
文摘To address the problem that traditional keypoint detection methods are susceptible to complex backgrounds and local similarity of images resulting in inaccurate descriptor matching and bias in visual localization, keypoints and descriptors based on cross-modality fusion are proposed and applied to the study of camera motion estimation. A convolutional neural network is used to detect the positions of keypoints and generate the corresponding descriptors, and the pyramid convolution is used to extract multi-scale features in the network. The problem of local similarity of images is solved by capturing local and global feature information and fusing the geometric position information of keypoints to generate descriptors. According to our experiments, the repeatability of our method is improved by 3.7%, and the homography estimation is improved by 1.6%. To demonstrate the practicability of the method, the visual odometry part of simultaneous localization and mapping is constructed and our method is 35% higher positioning accuracy than the traditional method.
文摘Person re-identification(ReID)is a sub-problem under image retrieval.It is a technology that uses computer vision to identify a specific pedestrian in a collection of pictures or videos.The pedestrian image under cross-device is taken from a monitored pedestrian image.At present,most ReID methods deal with the matching between visible and visible images,but with the continuous improvement of security monitoring system,more and more infrared cameras are used to monitor at night or in dim light.Due to the image differences between infrared camera and RGB camera,there is a huge visual difference between cross-modality images,so the traditional ReID method is difficult to apply in this scene.In view of this situation,studying the pedestrian matching between visible and infrared modalities is particularly crucial.Visible-infrared person re-identification(VI-ReID)was first proposed in 2017,and then attracted more and more attention,and many advanced methods emerged.
基金support from the MSCA-ITN-ETN project ActiveMatter sponsored by the European Commission(Horizon 2020,Project No.812780)support from the ERC-CoG project MAPEI sponsored by the European Commission(Horizon 2020,Project No.101001267)+1 种基金from the Knut and AliceWallenberg Foundation(Grant No.2019.0079)aroline Beck Adiels and Giovanni Volpe acknowledge the Swedish Foundation for Strategic Research(Grant No.ITM17-0384).
文摘Recent advancements in deep learning(DL)have propelled the virtual transformation of microscopy images across optical modalities,enabling unprecedented multimodal imaging analysis hitherto impossible.Despite these strides,the integration of such algorithms into scientists’daily routines and clinical trials remains limited,largely due to a lack of recognition within their respective fields and the plethora of available transformation methods.To address this,we present a structured overview of cross-modality transformations,encompassing applications,data sets,and implementations,aimed at unifying this evolving field.Our review focuses on DL solutions for two key applications:contrast enhancement of targeted features within images and resolution enhancements.We recognize cross-modality transformations as a valuable resource for biologists seeking a deeper understanding of the field,as well as for technology developers aiming to better grasp sample limitations and potential applications.Notably,they enable high-contrast,high-specificity imaging akin to fluorescence microscopy without the need for laborious,costly,and disruptive physical-staining procedures.In addition,they facilitate the realization of imaging with properties that would typically require costly or complex physical modifications,such as achieving superresolution capabilities.By consolidating the current state of research in this review,we aim to catalyze further investigation and development,ultimately bringing the potential of cross-modality transformations into the hands of researchers and clinicians alike.
基金funded by the Scientific Funding for China Academy of Railway Sciences Corporation Limited,China(No.2023YJ125).
文摘Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features.Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation.However,these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples,which are crucial for robust association modeling.To address these challenges,we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities.First,we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments.This shared pre-training step ensures the extraction of complementary identity information across modalities.Subsequently,we introduce a cross-modal simplex center loss,which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere.This design enforces an equidistant and balanced distribution of identity embeddings,reducing intra-class variations.Furthermore,we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability,enhancing the model’s ability to generalize across challenging scenarios.Extensive experiments validate the effectiveness of our framework,demonstrating superior performance across various speech-face association tasks,including matching,verification,and retrieval.Notably,in the challenging gender-constrained matching task,our method achieves a remarkable accuracy of 79.22%,significantly outperforming existing approaches.These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.
基金funded by the Institute of Information&CommunicationsTechnology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT),grant number 2021-0-01341.
文摘Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).
基金funded by the National Natural Science Foundation of China(grant number 62306186)the Technology Plan Joint Foundation of Liaoning Province(grant number 2023-MSLH-246)the Technology Plan Joint Foundation of Liaoning Province(grant number 2023-BSBA-238).
文摘Detecting surface defects on unused rails is crucial for evaluating rail quality and durability to ensure the safety of rail transportation.However,existing detection methods often struggle with challenges such as complex defect morphology,texture similarity,and fuzzy edges,leading to poor accuracy and missed detections.In order to resolve these problems,we propose MSCM-Net(Multi-Scale Cross-Modal Network),a multiscale cross-modal framework focused on detecting rail surface defects.MSCM-Net introduces an attention mechanism to dynamically weight the fusion of RGB and depth maps,effectively capturing and enhancing features at different scales for each modality.To further enrich feature representation and improve edge detection in blurred areas,we propose a multi-scale void fusion module that integrates multi-scale feature information.To improve cross-modal feature fusion,we develop a cross-enhanced fusion module that transfers fused features between layers to incorporate interlayer information.We also introduce a multimodal feature integration module,which merges modality-specific features from separate decoders into a shared decoder,enhancing detection by leveraging richer complementary information.Finally,we validate MSCM-Net on the NEU RSDDS-AUG RGB-depth dataset,comparing it against 12 leading methods,and the results show that MSCM-Net achieves superior performance on all metrics.
基金supported by Communication University of China(HG23035)partly supported by the Fundamental Research Funds for the Central Universities(CUC230A013).
文摘With the rapid growth of socialmedia,the spread of fake news has become a growing problem,misleading the public and causing significant harm.As social media content is often composed of both images and text,the use of multimodal approaches for fake news detection has gained significant attention.To solve the problems existing in previous multi-modal fake news detection algorithms,such as insufficient feature extraction and insufficient use of semantic relations between modes,this paper proposes the MFFFND-Co(Multimodal Feature Fusion Fake News Detection with Co-Attention Block)model.First,the model deeply explores the textual content,image content,and frequency domain features.Then,it employs a Co-Attention mechanism for cross-modal fusion.Additionally,a semantic consistency detectionmodule is designed to quantify semantic deviations,thereby enhancing the performance of fake news detection.Experimentally verified on two commonly used datasets,Twitter and Weibo,the model achieved F1 scores of 90.0% and 94.0%,respectively,significantly outperforming the pre-modified MFFFND(Multimodal Feature Fusion Fake News Detection with Attention Block)model and surpassing other baseline models.This improves the accuracy of detecting fake information in artificial intelligence detection and engineering software detection.
文摘With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to multimodalinformation exchange and fusion, with many methods attempting to integrate unimodal features to generatemultimodal news representations. However, they still need to fully explore the hierarchical and complex semanticcorrelations between different modal contents, severely limiting their performance detecting multimodal falseinformation. This work proposes a two-stage detection framework for multimodal false information detection,called ASMFD, which is based on image aesthetic similarity to segment and explores the consistency andinconsistency features of images and texts. Specifically, we first use the Contrastive Language-Image Pre-training(CLIP) model to learn the relationship between text and images through label awareness and train an imageaesthetic attribute scorer using an aesthetic attribute dataset. Then, we calculate the aesthetic similarity betweenthe image and related images and use this similarity as a threshold to divide the multimodal correlation matrixinto consistency and inconsistencymatrices. Finally, the fusionmodule is designed to identify essential features fordetectingmultimodal false information. In extensive experiments on four datasets, the performance of the ASMFDis superior to state-of-the-art baseline methods.
基金supported by the National Natural Science Foundation of China under Grant 61702462the Henan Provincial Science and Technology Research Project under Grants 222102210010 and 222102210064+2 种基金the Research and Practice Project of Higher Education Teaching Reform in Henan Province under Grants 2019SJGLX320 and 2019SJGLX020the Undergraduate Universities Smart Teaching Special Research Project of Henan Province under Grant JiaoGao[2021]No.489-29the Academic Degrees&Graduate Education Reform Project of Henan Province under Grant 2021SJGLX115Y.
文摘Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fusion method does not utilize the correlation information between modalities.To solve this problem,this paper proposes amodel based on amulti-head attention mechanism.First,after preprocessing the original data.Then,the feature representation is converted into a sequence of word vectors and positional encoding is introduced to better understand the semantic and sequential information in the input sequence.Next,the input coding sequence is fed into the transformer model for further processing and learning.At the transformer layer,a cross-modal attention consisting of a pair of multi-head attention modules is employed to reflect the correlation between modalities.Finally,the processed results are input into the feedforward neural network to obtain the emotional output through the classification layer.Through the above processing flow,the model can capture semantic information and contextual relationships and achieve good results in various natural language processing tasks.Our model was tested on the CMU Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI)and Multimodal EmotionLines Dataset(MELD),achieving an accuracy of 82.04% and F1 parameters reached 80.59% on the former dataset.
文摘The cross-modal person re-identification task aims to match visible and infrared images of the same individual.The main challenges in this field arise from significant modality differences between individuals and the lack of high-quality cross-modal correspondence methods.Existing approaches often attempt to establish modality correspondence by extracting shared features across different modalities.However,these methods tend to focus on local information extraction and fail to fully leverage the global identity information in the cross-modal features,resulting in limited correspondence accuracy and suboptimal matching performance.To address this issue,we propose a quadratic graph matching method designed to overcome the challenges posed by modality differences through precise cross-modal relationship alignment.This method transforms the cross-modal correspondence problem into a graph matching task and minimizes the matching cost using a center search mechanism.Building on this approach,we further design a block reasoning module to uncover latent relationships between person identities and optimize the modality correspondence results.The block strategy not only improves the efficiency of updating gallery images but also enhances matching accuracy while reducing computational load.Experimental results demonstrate that our proposed method outperforms the state-of-the-art methods on the SYSU-MM01,RegDB,and RGBNT201 datasets,achieving excellent matching accuracy and robustness,thereby validating its effectiveness in cross-modal person re-identification.
基金funded by the Ongoing Research Funding Program(ORF-2025-102),King Saud University,Riyadh,Saudi Arabiaby the Science and Technology Research Programof Chongqing Municipal Education Commission(Grant No.KJQN202400813)by the Graduate Research Innovation Project(Grant Nos.yjscxx2025-269-193 and CYS25618).
文摘Medical image analysis based on deep learning has become an important technical requirement in the field of smart healthcare.In view of the difficulties in collaborative modeling of local details and global features in multimodal image analysis of ophthalmology,as well as the existence of information redundancy in cross-modal data fusion,this paper proposes amultimodal fusion framework based on cross-modal collaboration and weighted attention mechanism.In terms of feature extraction,the framework collaboratively extracts local fine-grained features and global structural dependencies through a parallel dual-branch architecture,overcoming the limitations of traditional single-modality models in capturing either local or global information;in terms of fusion strategy,the framework innovatively designs a cross-modal dynamic fusion strategy,combining overlappingmulti-head self-attention modules with a bidirectional feature alignment mechanism,addressing the bottlenecks of low feature interaction efficiency and excessive attention fusion computations in traditional parallel fusion,and further introduces cross-domain local integration technology,which enhances the representation ability of the lesion area through pixel-level feature recalibration and optimizes the diagnostic robustness of complex cases.Experiments show that the framework exhibits excellent feature expression and generalization performance in cross-domain scenarios of ophthalmic medical images and natural images,providing a high-precision,low-redundancy fusion paradigm for multimodal medical image analysis,and promoting the upgrade of intelligent diagnosis and treatment fromsingle-modal static analysis to dynamic decision-making.
基金supported by the National Key R&D Program of China(No.2022ZD0118402)。
文摘Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.
基金the National Natural Science Foundation of China(No.61872231)the National Key Research and Development Program of China(No.2021YFC2801000)the Major Research plan of the National Social Science Foundation of China(No.2000&ZD130).
文摘Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach.
基金This study was supported by the National Natural Science Foundation of China(Nos.61703058,81873701).
文摘Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its control were not fully explored and the mechanisms remain poorly understood.This mini review investigated the impact of uni-modal and multi-modal information on the taste perception,from the perspective of cognitive status,such as emotion,expectation and attention,and discussed the hypothesis that the cognitive status is the key step for visual sense to exert influence on taste.This work may help researchers better understand the mechanism of cross-modal information processing and further develop neutrally-based artificial intelligent(AI)system.
基金This work was partially supported by Science and Technology Project of Chongqing Education Commission of China(KJZD-K202200513)National Natural Science Foundation of China(61370205)+1 种基金Chongqing Normal University Fund(22XLB003)Chongqing Education Science Planning Project(2021-GX-320).
文摘In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.
基金Supported by the National Natural Science Foundation of China(No.61305042,61202098)Projects of Center for Remote Sensing Mission Study of China National Space Administration(No.2012A03A0939)Science and Technological Research of Key Projects of Education Department of Henan Province of China(No.13A520071)
文摘Cross-modal semantic mapping and cross-media retrieval are key problems of the multimedia search engine.This study analyzes the hierarchy,the functionality,and the structure in the visual and auditory sensations of cognitive system,and establishes a brain-like cross-modal semantic mapping framework based on cognitive computing of visual and auditory sensations.The mechanism of visual-auditory multisensory integration,selective attention in thalamo-cortical,emotional control in limbic system and the memory-enhancing in hippocampal were considered in the framework.Then,the algorithms of cross-modal semantic mapping were given.Experimental results show that the framework can be effectively applied to the cross-modal semantic mapping,and also provides an important significance for brain-like computing of non-von Neumann structure.
基金supported by National Institutes of Health Contracts P30-EY008098 and T32-EY017271-06(BethesdaMD)+14 种基金United States Department of Defense DM090217(ArlingtonVA)Alcon Research Institute Young Investigator Grant(Fort WorthTX)Eye and Ear Foundation(PittsburghPA)Research to Prevent Blindness(New YorkNY)Aging Institute Pilot Seed GrantUniversity of Pittsburgh(PittsburghPA)Postdoctoral Fellowship Program in Ocular Tissue Engineering and Regenerative OphthalmologyLouis J.Fox Center for Vision RestorationUniversity of Pittsburgh and UPMC(PittsburghPA)
文摘Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input is completely absent from birth or very early in life(Kupers and Ptito,2014).More recent studies explored what happens to the visual pathways in the context of acquired blindness.This is particularly relevant as the majority of diseases that cause vision loss occur in the elderly.
基金This work was partially supported by Chongqing Natural Science Foundation of China(Grant No.CSTB2022NSCQ-MSX1417)the Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.KJZD-K202200513)+2 种基金Chongqing Normal University Fund(Grant No.22XLB003)Chongqing Education Science Planning Project(Grant No.2021-GX-320)Humanities and Social Sciences Project of Chongqing Education Commission of China(Grant No.22SKGH100).
文摘In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.
基金the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)the Major National Science and Technology Program(No.2015ZX01040201)the National Natural Science Foundation of China(No.61371196)。
文摘In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
文摘In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.