In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted fe...In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted features such asMel filters,often suffer frominformation loss and limited feature representation capabilities.To address these limitations,this study proposes an innovative end-to-end audio pattern recognition framework that directly processes raw audio signals,preserving original information and extracting effective classification features.The proposed framework utilizes a dual-branch architecture:a global refinement module that retains channel and temporal details and a multi-scale embedding module that captures high-level semantic information.Additionally,a guided fusion module integrates complementary features from both branches,ensuring a comprehensive representation of audio data.Specifically,the multi-scale audio context embedding module is designed to effectively extract spatiotemporal dependencies,while the global refinement module aggregates multi-scale channel and temporal cues for enhanced modeling.The guided fusion module leverages these features to achieve efficient integration of complementary information,resulting in improved classification accuracy.Experimental results demonstrate the model’s superior performance on multiple datasets,including ESC-50,UrbanSound8K,RAVDESS,and CREMA-D,with classification accuracies of 93.25%,90.91%,92.36%,and 70.50%,respectively.These results highlight the robustness and effectiveness of the proposed framework,which significantly outperforms existing approaches.By addressing critical challenges such as information loss and limited feature representation,thiswork provides newinsights and methodologies for advancing audio classification and multimodal interaction systems.展开更多
Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains...Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge.To enhance the accuracy(ACC)of bird audio detection(BAD)and reduce both false negatives and false positives,this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model(DFEFM).This method incorporates per-channel energy normalization(PCEN)to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients(MFCC)and frequency correlation matrices(FCM)as input features.It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches,and further integrates Spatial and Channel Synergistic Attention(SCSA)and Multi-Head Attention(MHA)modules to enhance the fusion effect of the aforementioned two deep features.Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4%and an AUC value of 0.963,with false negative and false positive rates of 11.36%and 7.40%,respectively,surpassing existing methods.The method also demonstrated detection ACC above 92%and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing.Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48%when processing an average of 10 s of audio,with a response time of only 0.557 s,showing excellent processing efficiency.This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices,which helps to save edge storage and information transmission costs,and has significant application value for wild bird monitoring and ecological research.展开更多
With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multi...With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multimedia files while enabling full recovery of the original data after extraction.Audio,as a vital medium in communication,entertainment,and information sharing,demands the same level of security as images.However,embedding data in encrypted audio poses unique challenges due to the trade-offs between security,data integrity,and embedding capacity.This paper presents a novel interpolation-based reversible data hiding algorithm for encrypted audio that achieves scalable embedding capacity.By increasing sample density through interpolation,embedding opportunities are significantly enhanced while maintaining encryption throughout the process.The method further integrates multiple most significant bit(multi-MSB)prediction and Huffman coding to optimize compression and embedding efficiency.Experimental results on standard audio datasets demonstrate the proposed algorithm’s ability to embed up to 12.47 bits per sample with over 9.26 bits per sample available for pure embedding capacity,while preserving full reversibility.These results confirm the method’s suitability for secure applications that demand high embedding capacity and perfect reconstruction of original audio.This work advances reversible data hiding in encrypted audio by offering a secure,efficient,and fully reversible data hiding framework.展开更多
Visual media have dominated sensory communications for decades,and the resulting“visual hegemony”leads to the call for the“auditory return”in order to achieve a holistic balance in cultural acceptance.Romance of t...Visual media have dominated sensory communications for decades,and the resulting“visual hegemony”leads to the call for the“auditory return”in order to achieve a holistic balance in cultural acceptance.Romance of the Three Kingdoms,a classic literary work in China,has received significant attention and promotion from leading audio platforms.However,the commercialization of digital audio publishing faces unprecedented challenges due to the mismatch between the dissemination of long-form content on digital audio platforms and the current trend of short and fast information reception.Drawing on the Business Model Canvas Theory and taking Romance of the Three Kingdoms as the main focus of analysis,this paper argues that the construction of a business model for the audio publishing of classical books should start from three aspects:the user evaluation of digital audio platforms,the establishment of value propositions based on the“creative transformation and innovative development”principle,and the improvement of the audio publishing infrastructure to ensure the healthy operation and development of the digital audio platforms and consequently improve their current state of development and expand the boundaries of cultural heritage.展开更多
基金supported by the National Natural Science Foundation of China(62106214)the Hebei Natural Science Foundation(D2024203008)the Provincial Key Laboratory Performance Subsidy Project(22567612H).
文摘In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted features such asMel filters,often suffer frominformation loss and limited feature representation capabilities.To address these limitations,this study proposes an innovative end-to-end audio pattern recognition framework that directly processes raw audio signals,preserving original information and extracting effective classification features.The proposed framework utilizes a dual-branch architecture:a global refinement module that retains channel and temporal details and a multi-scale embedding module that captures high-level semantic information.Additionally,a guided fusion module integrates complementary features from both branches,ensuring a comprehensive representation of audio data.Specifically,the multi-scale audio context embedding module is designed to effectively extract spatiotemporal dependencies,while the global refinement module aggregates multi-scale channel and temporal cues for enhanced modeling.The guided fusion module leverages these features to achieve efficient integration of complementary information,resulting in improved classification accuracy.Experimental results demonstrate the model’s superior performance on multiple datasets,including ESC-50,UrbanSound8K,RAVDESS,and CREMA-D,with classification accuracies of 93.25%,90.91%,92.36%,and 70.50%,respectively.These results highlight the robustness and effectiveness of the proposed framework,which significantly outperforms existing approaches.By addressing critical challenges such as information loss and limited feature representation,thiswork provides newinsights and methodologies for advancing audio classification and multimodal interaction systems.
基金supported by the Beijing Natural Science Foundation(5252014)the National Natural Science Foundation of China(62303063)。
文摘Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge.To enhance the accuracy(ACC)of bird audio detection(BAD)and reduce both false negatives and false positives,this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model(DFEFM).This method incorporates per-channel energy normalization(PCEN)to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients(MFCC)and frequency correlation matrices(FCM)as input features.It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches,and further integrates Spatial and Channel Synergistic Attention(SCSA)and Multi-Head Attention(MHA)modules to enhance the fusion effect of the aforementioned two deep features.Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4%and an AUC value of 0.963,with false negative and false positive rates of 11.36%and 7.40%,respectively,surpassing existing methods.The method also demonstrated detection ACC above 92%and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing.Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48%when processing an average of 10 s of audio,with a response time of only 0.557 s,showing excellent processing efficiency.This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices,which helps to save edge storage and information transmission costs,and has significant application value for wild bird monitoring and ecological research.
基金funded by theNational Science and Technology Council of Taiwan under the grant number NSTC 113-2221-E-035-058.
文摘With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multimedia files while enabling full recovery of the original data after extraction.Audio,as a vital medium in communication,entertainment,and information sharing,demands the same level of security as images.However,embedding data in encrypted audio poses unique challenges due to the trade-offs between security,data integrity,and embedding capacity.This paper presents a novel interpolation-based reversible data hiding algorithm for encrypted audio that achieves scalable embedding capacity.By increasing sample density through interpolation,embedding opportunities are significantly enhanced while maintaining encryption throughout the process.The method further integrates multiple most significant bit(multi-MSB)prediction and Huffman coding to optimize compression and embedding efficiency.Experimental results on standard audio datasets demonstrate the proposed algorithm’s ability to embed up to 12.47 bits per sample with over 9.26 bits per sample available for pure embedding capacity,while preserving full reversibility.These results confirm the method’s suitability for secure applications that demand high embedding capacity and perfect reconstruction of original audio.This work advances reversible data hiding in encrypted audio by offering a secure,efficient,and fully reversible data hiding framework.
基金This study is a phased achievement of the“Research on Innovative Communication of Romance of the Three Kingdoms under Audio Empowerment”project(No.23ZGL16)funded by Zhuge Liang Research Center,a key research base of social sciences in Sichuan Province.
文摘Visual media have dominated sensory communications for decades,and the resulting“visual hegemony”leads to the call for the“auditory return”in order to achieve a holistic balance in cultural acceptance.Romance of the Three Kingdoms,a classic literary work in China,has received significant attention and promotion from leading audio platforms.However,the commercialization of digital audio publishing faces unprecedented challenges due to the mismatch between the dissemination of long-form content on digital audio platforms and the current trend of short and fast information reception.Drawing on the Business Model Canvas Theory and taking Romance of the Three Kingdoms as the main focus of analysis,this paper argues that the construction of a business model for the audio publishing of classical books should start from three aspects:the user evaluation of digital audio platforms,the establishment of value propositions based on the“creative transformation and innovative development”principle,and the improvement of the audio publishing infrastructure to ensure the healthy operation and development of the digital audio platforms and consequently improve their current state of development and expand the boundaries of cultural heritage.