Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest....Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.展开更多
The rapid progression of the Internet of Things(IoT)technology enables its application across various sectors.However,IoT devices typically acquire inadequate computing power and user interfaces,making them susceptibl...The rapid progression of the Internet of Things(IoT)technology enables its application across various sectors.However,IoT devices typically acquire inadequate computing power and user interfaces,making them susceptible to security threats.One significant risk to cloud networks is Distributed Denial-of-Service(DoS)attacks,where attackers aim to overcome a target system with excessive data and requests.Among these,low-rate DoS(LR-DoS)attacks present a particular challenge to detection.By sending bursts of attacks at irregular intervals,LR-DoS significantly degrades the targeted system’s Quality of Service(QoS).The low-rate nature of these attacks confuses their detection,as they frequently trigger congestion control mechanisms,leading to significant instability in IoT systems.Therefore,to detect the LR-DoS attack,an innovative deep-learning model has been developed for this research work.The standard dataset is utilized to collect the required data.Further,the deep feature extraction process is executed using the Residual Autoencoder with Sparse Attention(ResAE-SA),which helps derive the significant feature required for detection.Ultimately,the Adaptive Dense Recurrent Neural Network(ADRNN)is implemented to detect LR-DoS effectively.To enhance the detection process,the parameters present in the ADRNN are optimized using the Renovated Random Attribute-based Fennec Fox Optimization(RRA-FFA).The proposed optimization reduces the False Discovery Rate and False Positive Rate,maximizing the Matthews Correlation Coefficient from 23,70.8,76.2,84.28 in Dataset 1 and 70.28,73.8,74.1,82.6 in Dataset 2 on EPC-ADRNN,DPO-ADRNN,GTO-ADRNN,FFA-ADRNN respectively to 95.8 on Dataset 1 and 91.7 on Dataset 2 in proposed model.At batch size 4,the accuracy of the designed RRA-FFA-ADRNN model progressed by 9.2%to GTO-ADRNN,11.6%to EFC-ADRNN,10.9%to DPO-ADRNN,and 4%to FFA-ADRNN for Dataset 1.The accuracy of the proposed RRA-FFA-ADRNN is boosted by 12.9%,9.09%,11.6%,and 10.9%over FFCNN,SVM,RNN,and DRNN,using Dataset 2,showing a better improvement in accuracy with that of the proposed RRA-FFA-ADRNN model with 95.7%using Dataset 1 and 94.1%with Dataset 2,which is better than the existing baseline models.展开更多
In response to the issues of poor adaptability to low signal-to-noise ratios(SNRs)in existing uniform linear array(ULA)multitarget estimation algorithms and the difficulty of current deep learning methods in effective...In response to the issues of poor adaptability to low signal-to-noise ratios(SNRs)in existing uniform linear array(ULA)multitarget estimation algorithms and the difficulty of current deep learning methods in effectively extracting complex-valued features from data,a cross-scale sparse attention module and a channel-hierarchical spatial pyramid attention module,which are based on the MSPANet block,are introduced into the deep neural network(DNN).This approach better extracts multiscale features of signalling components,facilitating accurate signal feature extraction under low SNR conditions.Experimental data demonstrate that this deep learning model can significantly enhance the accuracy and anti-jamming capability of direction-of-arrival(DOA)estimation in low-signal-to-noise ratio(SNR)scenarios,outperforming traditional methods such as CBF,MUSIC,and ESPRIT.The above optimization achievements possess important practical value for DOA estimation applications in fields like intelligent speech,radar detection,communication systems,and autonomous driving.展开更多
Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experi...Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experience difficulties in efficiently managing redundancy and inconsistency across features from different modalities,compromising sentiment analysis accuracy.Additionally,while the analysis of intraclass emotional features has garnered substantial attention,studies of interclass relationships have been neglected.To address these challenges,a multimodal sentiment analysis method based on contrastive learning and cross-modal guided fusion(CLCGF)is proposed.This method encodes text and images to derive latent representations and employs a cross-modal guided module with sparse attention mechanisms to effectively integrate textual and visual features,thereby mitigating redundancy issues within each modality's features.In addition to the sentiment classification task,a supervised contrastive learning task is incorporated to aid the model in learning effective features from multimodal data related to emotions.To assess the efficacy of the CLCGF method,experiments were conducted on three public datasets:MVSA-Single,MVSA-Multiple and HFM.The experimental results indicate that CLCGF significantly improves sentiment analysis accuracy compared with traditional methods.展开更多
基金supported in part by the National Natural Science Foundation of China:61773330.
文摘Audio-visual speech recognition(AVSR),which integrates audio and visual modalities to improve recognition performance and robustness in noisy or adverse acoustic conditions,has attracted significant research interest.However,Conformer-based architectures remain computational expensive due to the quadratic increase in the spatial and temporal complexity of their softmax-based attention mechanisms with sequence length.In addition,Conformerbased architectures may not provide sufficient flexibility for modeling local dependencies at different granularities.To mitigate these limitations,this study introduces a novel AVSR framework based on a ReLU-based Sparse and Grouped Conformer(RSG-Conformer)architecture.Specifically,we propose a Global-enhanced Sparse Attention(GSA)module incorporating an efficient context restoration block to recover lost contextual cues.Concurrently,a Grouped-scale Convolution(GSC)module replaces the standard Conformer convolution module,providing adaptive local modeling across varying temporal resolutions.Furthermore,we integrate a Refined Intermediate Contextual CTC(RIC-CTC)supervision strategy.This approach applies progressively increasing loss weights combined with convolution-based context aggregation,thereby further relaxing the constraint of conditional independence inherent in standard CTC frameworks.Evaluations on the LRS2 and LRS3 benchmark validate the efficacy of our approach,with word error rates(WERs)reduced to 1.8%and 1.5%,respectively.These results further demonstrate and validate its state-of-the-art performance in AVSR tasks.
基金funded by the Ministry of Higher Education Malaysia,Fundamental Research Grant Scheme(FRGS),FRGS/1/2024/ICT07/UPNM/02/1.
文摘The rapid progression of the Internet of Things(IoT)technology enables its application across various sectors.However,IoT devices typically acquire inadequate computing power and user interfaces,making them susceptible to security threats.One significant risk to cloud networks is Distributed Denial-of-Service(DoS)attacks,where attackers aim to overcome a target system with excessive data and requests.Among these,low-rate DoS(LR-DoS)attacks present a particular challenge to detection.By sending bursts of attacks at irregular intervals,LR-DoS significantly degrades the targeted system’s Quality of Service(QoS).The low-rate nature of these attacks confuses their detection,as they frequently trigger congestion control mechanisms,leading to significant instability in IoT systems.Therefore,to detect the LR-DoS attack,an innovative deep-learning model has been developed for this research work.The standard dataset is utilized to collect the required data.Further,the deep feature extraction process is executed using the Residual Autoencoder with Sparse Attention(ResAE-SA),which helps derive the significant feature required for detection.Ultimately,the Adaptive Dense Recurrent Neural Network(ADRNN)is implemented to detect LR-DoS effectively.To enhance the detection process,the parameters present in the ADRNN are optimized using the Renovated Random Attribute-based Fennec Fox Optimization(RRA-FFA).The proposed optimization reduces the False Discovery Rate and False Positive Rate,maximizing the Matthews Correlation Coefficient from 23,70.8,76.2,84.28 in Dataset 1 and 70.28,73.8,74.1,82.6 in Dataset 2 on EPC-ADRNN,DPO-ADRNN,GTO-ADRNN,FFA-ADRNN respectively to 95.8 on Dataset 1 and 91.7 on Dataset 2 in proposed model.At batch size 4,the accuracy of the designed RRA-FFA-ADRNN model progressed by 9.2%to GTO-ADRNN,11.6%to EFC-ADRNN,10.9%to DPO-ADRNN,and 4%to FFA-ADRNN for Dataset 1.The accuracy of the proposed RRA-FFA-ADRNN is boosted by 12.9%,9.09%,11.6%,and 10.9%over FFCNN,SVM,RNN,and DRNN,using Dataset 2,showing a better improvement in accuracy with that of the proposed RRA-FFA-ADRNN model with 95.7%using Dataset 1 and 94.1%with Dataset 2,which is better than the existing baseline models.
基金funded by the Xinjiang Uygur Autonomous Region Natural Science Foundation General Program(Project Number:2023D01C18)the second batch of Tianchi Talents(Leading Talents)project in Xinjiang Uygur Autonomous Region.Project leader:Lei Liu from School of Computer Science and Technology,Xinjiang University.
文摘In response to the issues of poor adaptability to low signal-to-noise ratios(SNRs)in existing uniform linear array(ULA)multitarget estimation algorithms and the difficulty of current deep learning methods in effectively extracting complex-valued features from data,a cross-scale sparse attention module and a channel-hierarchical spatial pyramid attention module,which are based on the MSPANet block,are introduced into the deep neural network(DNN).This approach better extracts multiscale features of signalling components,facilitating accurate signal feature extraction under low SNR conditions.Experimental data demonstrate that this deep learning model can significantly enhance the accuracy and anti-jamming capability of direction-of-arrival(DOA)estimation in low-signal-to-noise ratio(SNR)scenarios,outperforming traditional methods such as CBF,MUSIC,and ESPRIT.The above optimization achievements possess important practical value for DOA estimation applications in fields like intelligent speech,radar detection,communication systems,and autonomous driving.
文摘针对三维人体姿态估计实际应用场景需求,提出一种基于空洞卷积ResNet模块和稀疏自注意力(Sparse Attention,SA)的轻量化三维人体姿态估计模型DS-Net(Dilated Sparse Attention Network)。首先,以单目、单阶段、多个三维人的回归网络(Monocular,One-stage,Regression of Multiple 3D People,ROMP)为基础姿态估计模型,并替换支路中基础ResNet模块的卷积为空洞卷积,在不降低精度的前提下减少模型参数量;其次,在支路中嵌入Sparse Attention,加强上下文理解能力以提高精度;最后,经过7个数据集训练和3DPW数据集测试,验证模型可行性。经实验验证,提出的DS-Net总参数量减少53.8%;在三维人体姿态估计任务中与ROMP相比,MPJPE和PA-MPJPE分别降低1.8%和2.9%,满足姿态估计实际应用场景需求。
基金supported by the Science and Technology Project in Xi'an(22GXFW0123)。
文摘Multimodal sentiment analysis,which integrates text,speech,and image modalities,has emerged as a prominent research direction in artificial intelligence for precise emotion assessment.However,current techniques experience difficulties in efficiently managing redundancy and inconsistency across features from different modalities,compromising sentiment analysis accuracy.Additionally,while the analysis of intraclass emotional features has garnered substantial attention,studies of interclass relationships have been neglected.To address these challenges,a multimodal sentiment analysis method based on contrastive learning and cross-modal guided fusion(CLCGF)is proposed.This method encodes text and images to derive latent representations and employs a cross-modal guided module with sparse attention mechanisms to effectively integrate textual and visual features,thereby mitigating redundancy issues within each modality's features.In addition to the sentiment classification task,a supervised contrastive learning task is incorporated to aid the model in learning effective features from multimodal data related to emotions.To assess the efficacy of the CLCGF method,experiments were conducted on three public datasets:MVSA-Single,MVSA-Multiple and HFM.The experimental results indicate that CLCGF significantly improves sentiment analysis accuracy compared with traditional methods.