To enhance speech emotion recognition capability,this study constructs a speech emotion recognition model integrating the adaptive acoustic mixup(AAM)and improved coordinate and shuffle attention(ICASA)methods.The AAM...To enhance speech emotion recognition capability,this study constructs a speech emotion recognition model integrating the adaptive acoustic mixup(AAM)and improved coordinate and shuffle attention(ICASA)methods.The AAM method optimizes data augmentation by combining a sample selection strategy and dynamic interpolation coefficients,thus enabling information fusion of speech data with different emotions at the acoustic level.The ICASA method enhances feature extraction capability through dynamic fusion of the improved coordinate attention(ICA)and shuffle attention(SA)techniques.The ICA technique reduces computational overhead by employing depth-separable convolution and an h-swish activation function and captures long-range dependencies of multi-scale time-frequency features using the attention weights.The SA technique promotes feature interaction through channel shuffling,which helps the model learn richer and more discriminative emotional features.Experimental results demonstrate that,compared to the baseline model,the proposed model improves the weighted accuracy by 5.42%and 4.54%,and the unweighted accuracy by 3.37%and 3.85%on the IEMOCAP and RAVDESS datasets,respectively.These improvements were confirmed to be statistically significant by independent samples t-tests,further supporting the practical reliability and applicability of the proposed model in real-world emotion-aware speech systems.展开更多
Lip language provides a silent,intuitive,and efficient mode of communication,offering a promising solution for individuals with speech impairments.Its articulation relies on complex movements of the jaw and the muscle...Lip language provides a silent,intuitive,and efficient mode of communication,offering a promising solution for individuals with speech impairments.Its articulation relies on complex movements of the jaw and the muscles surrounding it.However,the accurate and real-time acquisition and decoding of these movements into reliable silent speech signals remains a significant challenge.In this work,we propose a real-time silent speech recognition system,which integrates a triboelectric nanogenerator-based flexible pressure sensor(FPS)with a deep learning framework.The FPS employs a porous pyramid-structured silicone film as the negative triboelectric layer,enabling highly sensitive pressure detection in the low-force regime(1 V N^(-1) for 0-10 N and 4.6 V N^(-1) for 10-24 N).This allows it to precisely capture jaw movements during speech and convert them into electrical signals.To decode the signals,we proposed a convolutional neural networklong short-term memory(CNN-LSTM)hybrid network,combining CNN and LSTM model to extract both local spatial features and temporal dynamics.The model achieved 95.83%classification accuracy in 30 categories of daily words.Furthermore,the decoded silent speech signals can be directly translated into executable commands for contactless and precise control of the smartphone.The system can also be connected to AR glasses,offering a novel human-machine interaction approach with promising potential in AR/VR applications.展开更多
Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest....Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.12204062the Natural Science Foundation of Shandong Province under Grant No.ZR2022MF330。
文摘To enhance speech emotion recognition capability,this study constructs a speech emotion recognition model integrating the adaptive acoustic mixup(AAM)and improved coordinate and shuffle attention(ICASA)methods.The AAM method optimizes data augmentation by combining a sample selection strategy and dynamic interpolation coefficients,thus enabling information fusion of speech data with different emotions at the acoustic level.The ICASA method enhances feature extraction capability through dynamic fusion of the improved coordinate attention(ICA)and shuffle attention(SA)techniques.The ICA technique reduces computational overhead by employing depth-separable convolution and an h-swish activation function and captures long-range dependencies of multi-scale time-frequency features using the attention weights.The SA technique promotes feature interaction through channel shuffling,which helps the model learn richer and more discriminative emotional features.Experimental results demonstrate that,compared to the baseline model,the proposed model improves the weighted accuracy by 5.42%and 4.54%,and the unweighted accuracy by 3.37%and 3.85%on the IEMOCAP and RAVDESS datasets,respectively.These improvements were confirmed to be statistically significant by independent samples t-tests,further supporting the practical reliability and applicability of the proposed model in real-world emotion-aware speech systems.
基金supported by the Natural Science Foundation of Fujian Province under Grant No.2024J010016Fujian Province Young and Middle aged Teacher Education Research Project No.JAT241317the Mindu Innovation Laboratory Project under Grant No.2020ZZ113.
文摘Lip language provides a silent,intuitive,and efficient mode of communication,offering a promising solution for individuals with speech impairments.Its articulation relies on complex movements of the jaw and the muscles surrounding it.However,the accurate and real-time acquisition and decoding of these movements into reliable silent speech signals remains a significant challenge.In this work,we propose a real-time silent speech recognition system,which integrates a triboelectric nanogenerator-based flexible pressure sensor(FPS)with a deep learning framework.The FPS employs a porous pyramid-structured silicone film as the negative triboelectric layer,enabling highly sensitive pressure detection in the low-force regime(1 V N^(-1) for 0-10 N and 4.6 V N^(-1) for 10-24 N).This allows it to precisely capture jaw movements during speech and convert them into electrical signals.To decode the signals,we proposed a convolutional neural networklong short-term memory(CNN-LSTM)hybrid network,combining CNN and LSTM model to extract both local spatial features and temporal dynamics.The model achieved 95.83%classification accuracy in 30 categories of daily words.Furthermore,the decoded silent speech signals can be directly translated into executable commands for contactless and precise control of the smartphone.The system can also be connected to AR glasses,offering a novel human-machine interaction approach with promising potential in AR/VR applications.
基金supported in part by the National Natural Science Foundation of China:61773330.
文摘Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.