Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed ...Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).展开更多
In this paper, new risk measures are introduced, tation results are also given. These newly introduced risk introduced by Song and Yan (2009) and Karoui (2009). and the corresponding represen- measures are extens...In this paper, new risk measures are introduced, tation results are also given. These newly introduced risk introduced by Song and Yan (2009) and Karoui (2009). and the corresponding represen- measures are extensions of those展开更多
Transformers designed for natural language processing have originally been explored for computer vision in recent research. Various Vision Transformers(ViTs) play an increasingly important role in the field of image t...Transformers designed for natural language processing have originally been explored for computer vision in recent research. Various Vision Transformers(ViTs) play an increasingly important role in the field of image tasks such as computer vision, multimodal fusion and multimedia analysis. However, to obtain promising performance, most existing ViTs usually rely on artificially filtered high-quality images, which may suffer from inherent noise risk.Generally, such well-constructed images are not always available in every situation. To this end,we propose a Robust ViT(RViT) to focus on the relevant and robust representation learning for image classification tasks. Specifically, we first develop a novel Denoising VTUnet module,where we conceptualize the nonrobust noise as the uncertainty under the variational conditions.Furthermore, we design a fusion transformer backbone with a tailored fusion attention mechanism to perform image classification based on the extracted robust representations effectively. To demonstrate the superiority of our model, the compared experiments are conducted on several popular datasets. Benefiting from the sequence regularity of the Transformer and captured robust feature,the proposed method exceeds compared Transformer-based models with superior performance in visual tasks.展开更多
Face recognition is a big challenge in the research field with a lot of problems like misalignment,illumination changes,pose variations,occlusion,and expressions.Providing a single solution to solve all these problems...Face recognition is a big challenge in the research field with a lot of problems like misalignment,illumination changes,pose variations,occlusion,and expressions.Providing a single solution to solve all these problems at a time is a challenging task.We have put some effort to provide a solution to solving all these issues by introducing a face recognition model based on local tetra patterns and spatial pyramid matching.The technique is based on a procedure where the input image is passed through an algorithm that extracts local features by using spatial pyramid matching andmax-pooling.Finally,the input image is recognized using a robust kernel representation method using extracted features.The qualitative and quantitative analysis of the proposed method is carried on benchmark image datasets.Experimental results showed that the proposed method performs better in terms of standard performance evaluation parameters as compared to state-of-the-art methods on AR,ORL,LFW,and FERET face recognition datasets.展开更多
基金funded by the Institute of Information&CommunicationsTechnology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT),grant number 2021-0-01341.
文摘Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems.The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference,such as background noise,overlapping speech,and visual obstructions.The proposed framework employs a structured fusion approach,combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms.Audio data are transformed into mel-spectrogram representations,and visual data are represented as raw frame sequences.Spatial and temporal features are extracted through convolutional and transformer-based encoders,allowing the framework to capture complementary and hierarchical information fromboth sources.Across-modal attentionmodule enables selective emphasis on relevant signals while suppressing modality-specific noise.Performance is validated on a modified version of the AFEW dataset,in which controlled noise is introduced to emulate realistic conditions.The framework achieves higher classification accuracy than comparative baselines,confirming increased robustness under conditions of cross-modal disruption.This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments.The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilientmultimodal emotion analysis frameworks.The source code is publicly available at https://github.com/asmoon002/AVER(accessed on 18 August 2025).
基金Supported in part by the National Natural Science Foundation of China (10971157)Key Projects of Philosophy and Social Sciences Research+1 种基金Ministry of Education of China (09JZD0027)The Talent Introduction Projects of Nanjing Audit University
文摘In this paper, new risk measures are introduced, tation results are also given. These newly introduced risk introduced by Song and Yan (2009) and Karoui (2009). and the corresponding represen- measures are extensions of those
文摘Transformers designed for natural language processing have originally been explored for computer vision in recent research. Various Vision Transformers(ViTs) play an increasingly important role in the field of image tasks such as computer vision, multimodal fusion and multimedia analysis. However, to obtain promising performance, most existing ViTs usually rely on artificially filtered high-quality images, which may suffer from inherent noise risk.Generally, such well-constructed images are not always available in every situation. To this end,we propose a Robust ViT(RViT) to focus on the relevant and robust representation learning for image classification tasks. Specifically, we first develop a novel Denoising VTUnet module,where we conceptualize the nonrobust noise as the uncertainty under the variational conditions.Furthermore, we design a fusion transformer backbone with a tailored fusion attention mechanism to perform image classification based on the extracted robust representations effectively. To demonstrate the superiority of our model, the compared experiments are conducted on several popular datasets. Benefiting from the sequence regularity of the Transformer and captured robust feature,the proposed method exceeds compared Transformer-based models with superior performance in visual tasks.
基金This project was funded by the Deanship of Scientific Research(DSR)at King Abdul Aziz University,Jeddah,under Grant No.KEP-10-611-42.The authors,therefore,acknowledge with thanks DSR technical and financial support.
文摘Face recognition is a big challenge in the research field with a lot of problems like misalignment,illumination changes,pose variations,occlusion,and expressions.Providing a single solution to solve all these problems at a time is a challenging task.We have put some effort to provide a solution to solving all these issues by introducing a face recognition model based on local tetra patterns and spatial pyramid matching.The technique is based on a procedure where the input image is passed through an algorithm that extracts local features by using spatial pyramid matching andmax-pooling.Finally,the input image is recognized using a robust kernel representation method using extracted features.The qualitative and quantitative analysis of the proposed method is carried on benchmark image datasets.Experimental results showed that the proposed method performs better in terms of standard performance evaluation parameters as compared to state-of-the-art methods on AR,ORL,LFW,and FERET face recognition datasets.