Sudden wildfires cause significant global ecological damage.While satellite imagery has advanced early fire detection and mitigation,image-based systems face limitations including high false alarm rates,visual obstruc...Sudden wildfires cause significant global ecological damage.While satellite imagery has advanced early fire detection and mitigation,image-based systems face limitations including high false alarm rates,visual obstructions,and substantial computational demands,especially in complex forest terrains.To address these challenges,this study proposes a novel forest fire detection model utilizing audio classification and machine learning.We developed an audio-based pipeline using real-world environmental sound recordings.Sounds were converted into Mel-spectrograms and classified via a Convolutional Neural Network(CNN),enabling the capture of distinctive fire acoustic signatures(e.g.,crackling,roaring)that are minimally impacted by visual or weather conditions.Internet of Things(IoT)sound sensors were crucial for generating complex environmental parameters to optimize feature extraction.The CNN model achieved high performance in stratified 5-fold cross-validation(92.4%±1.6 accuracy,91.2%±1.8 F1-score)and on test data(94.93%accuracy,93.04%F1-score),with 98.44%precision and 88.32%recall,demonstrating reliability across environmental conditions.These results indicate that the audio-based approach not only improves detection reliability but also markedly reduces computational overhead compared to traditional image-based methods.The findings suggest that acoustic sensing integrated with machine learning offers a powerful,low-cost,and efficient solution for real-time forest fire monitoring in complex,dynamic environments.展开更多
Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains...Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge.To enhance the accuracy(ACC)of bird audio detection(BAD)and reduce both false negatives and false positives,this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model(DFEFM).This method incorporates per-channel energy normalization(PCEN)to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients(MFCC)and frequency correlation matrices(FCM)as input features.It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches,and further integrates Spatial and Channel Synergistic Attention(SCSA)and Multi-Head Attention(MHA)modules to enhance the fusion effect of the aforementioned two deep features.Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4%and an AUC value of 0.963,with false negative and false positive rates of 11.36%and 7.40%,respectively,surpassing existing methods.The method also demonstrated detection ACC above 92%and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing.Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48%when processing an average of 10 s of audio,with a response time of only 0.557 s,showing excellent processing efficiency.This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices,which helps to save edge storage and information transmission costs,and has significant application value for wild bird monitoring and ecological research.展开更多
In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted fe...In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted features such asMel filters,often suffer frominformation loss and limited feature representation capabilities.To address these limitations,this study proposes an innovative end-to-end audio pattern recognition framework that directly processes raw audio signals,preserving original information and extracting effective classification features.The proposed framework utilizes a dual-branch architecture:a global refinement module that retains channel and temporal details and a multi-scale embedding module that captures high-level semantic information.Additionally,a guided fusion module integrates complementary features from both branches,ensuring a comprehensive representation of audio data.Specifically,the multi-scale audio context embedding module is designed to effectively extract spatiotemporal dependencies,while the global refinement module aggregates multi-scale channel and temporal cues for enhanced modeling.The guided fusion module leverages these features to achieve efficient integration of complementary information,resulting in improved classification accuracy.Experimental results demonstrate the model’s superior performance on multiple datasets,including ESC-50,UrbanSound8K,RAVDESS,and CREMA-D,with classification accuracies of 93.25%,90.91%,92.36%,and 70.50%,respectively.These results highlight the robustness and effectiveness of the proposed framework,which significantly outperforms existing approaches.By addressing critical challenges such as information loss and limited feature representation,thiswork provides newinsights and methodologies for advancing audio classification and multimodal interaction systems.展开更多
Cardiovascular diseases(CVDs)remain one of the foremost causes of death globally;hence,the need for several must-have,advanced automated diagnostic solutions towards early detection and intervention.Traditional auscul...Cardiovascular diseases(CVDs)remain one of the foremost causes of death globally;hence,the need for several must-have,advanced automated diagnostic solutions towards early detection and intervention.Traditional auscultation of cardiovascular sounds is heavily reliant on clinical expertise and subject to high variability.To counter this limitation,this study proposes an AI-driven classification system for cardiovascular sounds whereby deep learning techniques are engaged to automate the detection of an abnormal heartbeat.We employ FastAI vision-learner-based convolutional neural networks(CNNs)that include ResNet,DenseNet,VGG,ConvNeXt,SqueezeNet,and AlexNet to classify heart sound recordings.Instead of raw waveform analysis,the proposed approach transforms preprocessed cardiovascular audio signals into spectrograms,which are suited for capturing temporal and frequency-wise patterns.The models are trained on the PASCAL Cardiovascular Challenge dataset while taking into consideration the recording variations,noise levels,and acoustic distortions.To demonstrate generalization,external validation using Google’s Audio set Heartbeat Sound dataset was performed using a dataset rich in cardiovascular sounds.Comparative analysis revealed that DenseNet-201,ConvNext Large,and ResNet-152 could deliver superior performance to the other architectures,achieving an accuracy of 81.50%,a precision of 85.50%,and an F1-score of 84.50%.In the process,we performed statistical significance testing,such as the Wilcoxon signed-rank test,to validate performance improvements over traditional classification methods.Beyond the technical contributions,the research underscores clinical integration,outlining a pathway in which the proposed system can augment conventional electronic stethoscopes and telemedicine platforms in the AI-assisted diagnostic workflows.We also discuss in detail issues of computational efficiency,model interpretability,and ethical considerations,particularly concerning algorithmic bias stemming from imbalanced datasets and the need for real-time processing in clinical settings.The study describes a scalable,automated system combining deep learning,feature extraction using spectrograms,and external validation that can assist healthcare providers in the early and accurate detection of cardiovascular disease.AI-driven solutions can be viable in improving access,reducing delays in diagnosis,and ultimately even the continued global burden of heart disease.展开更多
Pre-reading What is a Deepfake,and how is it created?What are Deepfakes?A Deepfake is a video,image or audio clip that has been created using artificial intelligence.The idea is to make it as realistic as possible.
With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multi...With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multimedia files while enabling full recovery of the original data after extraction.Audio,as a vital medium in communication,entertainment,and information sharing,demands the same level of security as images.However,embedding data in encrypted audio poses unique challenges due to the trade-offs between security,data integrity,and embedding capacity.This paper presents a novel interpolation-based reversible data hiding algorithm for encrypted audio that achieves scalable embedding capacity.By increasing sample density through interpolation,embedding opportunities are significantly enhanced while maintaining encryption throughout the process.The method further integrates multiple most significant bit(multi-MSB)prediction and Huffman coding to optimize compression and embedding efficiency.Experimental results on standard audio datasets demonstrate the proposed algorithm’s ability to embed up to 12.47 bits per sample with over 9.26 bits per sample available for pure embedding capacity,while preserving full reversibility.These results confirm the method’s suitability for secure applications that demand high embedding capacity and perfect reconstruction of original audio.This work advances reversible data hiding in encrypted audio by offering a secure,efficient,and fully reversible data hiding framework.展开更多
In order to solve the problems of limited performance of single and homogeneous sensors and high false alarm rate caused by environmental noise,heterogeneous sensor data fusion(HSDF),a fall detection framework based o...In order to solve the problems of limited performance of single and homogeneous sensors and high false alarm rate caused by environmental noise,heterogeneous sensor data fusion(HSDF),a fall detection framework based on heterogeneous sensor fusion of acceleration and audio signals,is proposed in this paper.By analyzing the heterogeneity of acceleration data and audio data,the framework uses Dempstershafer Theory(D-S Theory)to integrate the output of acceleration and audio data at decision level.Firstly,a normalized window interception algorithm——anomaly location window algorithm(ALW)is proposed by analyzing the fall process and the characteristics of acceleration changes based on acceleration data.Secondly,the one-dimensional residual convolutional network(1D-ReCNN)is designed for fall detection based on the audio data.Finally,it is verified that the HSDF framework has good advantages in terms of sensitivity and false alarm rate by the collection of volunteers’simulated fall data and free living data in real environment.展开更多
基金funded by the Directorate of Research and Community Service,Directorate General of Research and Development,Ministry of Higher Education,Science and Technologyin accordance with the Implementation Contract for the Operational Assistance Program for State Universities,Research Program Number:109/C3/DT.05.00/PL/2025.
文摘Sudden wildfires cause significant global ecological damage.While satellite imagery has advanced early fire detection and mitigation,image-based systems face limitations including high false alarm rates,visual obstructions,and substantial computational demands,especially in complex forest terrains.To address these challenges,this study proposes a novel forest fire detection model utilizing audio classification and machine learning.We developed an audio-based pipeline using real-world environmental sound recordings.Sounds were converted into Mel-spectrograms and classified via a Convolutional Neural Network(CNN),enabling the capture of distinctive fire acoustic signatures(e.g.,crackling,roaring)that are minimally impacted by visual or weather conditions.Internet of Things(IoT)sound sensors were crucial for generating complex environmental parameters to optimize feature extraction.The CNN model achieved high performance in stratified 5-fold cross-validation(92.4%±1.6 accuracy,91.2%±1.8 F1-score)and on test data(94.93%accuracy,93.04%F1-score),with 98.44%precision and 88.32%recall,demonstrating reliability across environmental conditions.These results indicate that the audio-based approach not only improves detection reliability but also markedly reduces computational overhead compared to traditional image-based methods.The findings suggest that acoustic sensing integrated with machine learning offers a powerful,low-cost,and efficient solution for real-time forest fire monitoring in complex,dynamic environments.
基金supported by the Beijing Natural Science Foundation(5252014)the National Natural Science Foundation of China(62303063)。
文摘Passive acoustic monitoring(PAM)technology is increasingly becoming one of the mainstream methods for bird monitoring.However,detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge.To enhance the accuracy(ACC)of bird audio detection(BAD)and reduce both false negatives and false positives,this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model(DFEFM).This method incorporates per-channel energy normalization(PCEN)to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients(MFCC)and frequency correlation matrices(FCM)as input features.It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches,and further integrates Spatial and Channel Synergistic Attention(SCSA)and Multi-Head Attention(MHA)modules to enhance the fusion effect of the aforementioned two deep features.Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4%and an AUC value of 0.963,with false negative and false positive rates of 11.36%and 7.40%,respectively,surpassing existing methods.The method also demonstrated detection ACC above 92%and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing.Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48%when processing an average of 10 s of audio,with a response time of only 0.557 s,showing excellent processing efficiency.This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices,which helps to save edge storage and information transmission costs,and has significant application value for wild bird monitoring and ecological research.
基金supported by the National Natural Science Foundation of China(62106214)the Hebei Natural Science Foundation(D2024203008)the Provincial Key Laboratory Performance Subsidy Project(22567612H).
文摘In recent years,audio pattern recognition has emerged as a key area of research,driven by its applications in human-computer interaction,robotics,and healthcare.Traditional methods,which rely heavily on handcrafted features such asMel filters,often suffer frominformation loss and limited feature representation capabilities.To address these limitations,this study proposes an innovative end-to-end audio pattern recognition framework that directly processes raw audio signals,preserving original information and extracting effective classification features.The proposed framework utilizes a dual-branch architecture:a global refinement module that retains channel and temporal details and a multi-scale embedding module that captures high-level semantic information.Additionally,a guided fusion module integrates complementary features from both branches,ensuring a comprehensive representation of audio data.Specifically,the multi-scale audio context embedding module is designed to effectively extract spatiotemporal dependencies,while the global refinement module aggregates multi-scale channel and temporal cues for enhanced modeling.The guided fusion module leverages these features to achieve efficient integration of complementary information,resulting in improved classification accuracy.Experimental results demonstrate the model’s superior performance on multiple datasets,including ESC-50,UrbanSound8K,RAVDESS,and CREMA-D,with classification accuracies of 93.25%,90.91%,92.36%,and 70.50%,respectively.These results highlight the robustness and effectiveness of the proposed framework,which significantly outperforms existing approaches.By addressing critical challenges such as information loss and limited feature representation,thiswork provides newinsights and methodologies for advancing audio classification and multimodal interaction systems.
基金funded by the deanship of scientific research(DSR),King Abdulaziz University,Jeddah,under grant No.(G-1436-611-309).
文摘Cardiovascular diseases(CVDs)remain one of the foremost causes of death globally;hence,the need for several must-have,advanced automated diagnostic solutions towards early detection and intervention.Traditional auscultation of cardiovascular sounds is heavily reliant on clinical expertise and subject to high variability.To counter this limitation,this study proposes an AI-driven classification system for cardiovascular sounds whereby deep learning techniques are engaged to automate the detection of an abnormal heartbeat.We employ FastAI vision-learner-based convolutional neural networks(CNNs)that include ResNet,DenseNet,VGG,ConvNeXt,SqueezeNet,and AlexNet to classify heart sound recordings.Instead of raw waveform analysis,the proposed approach transforms preprocessed cardiovascular audio signals into spectrograms,which are suited for capturing temporal and frequency-wise patterns.The models are trained on the PASCAL Cardiovascular Challenge dataset while taking into consideration the recording variations,noise levels,and acoustic distortions.To demonstrate generalization,external validation using Google’s Audio set Heartbeat Sound dataset was performed using a dataset rich in cardiovascular sounds.Comparative analysis revealed that DenseNet-201,ConvNext Large,and ResNet-152 could deliver superior performance to the other architectures,achieving an accuracy of 81.50%,a precision of 85.50%,and an F1-score of 84.50%.In the process,we performed statistical significance testing,such as the Wilcoxon signed-rank test,to validate performance improvements over traditional classification methods.Beyond the technical contributions,the research underscores clinical integration,outlining a pathway in which the proposed system can augment conventional electronic stethoscopes and telemedicine platforms in the AI-assisted diagnostic workflows.We also discuss in detail issues of computational efficiency,model interpretability,and ethical considerations,particularly concerning algorithmic bias stemming from imbalanced datasets and the need for real-time processing in clinical settings.The study describes a scalable,automated system combining deep learning,feature extraction using spectrograms,and external validation that can assist healthcare providers in the early and accurate detection of cardiovascular disease.AI-driven solutions can be viable in improving access,reducing delays in diagnosis,and ultimately even the continued global burden of heart disease.
文摘Pre-reading What is a Deepfake,and how is it created?What are Deepfakes?A Deepfake is a video,image or audio clip that has been created using artificial intelligence.The idea is to make it as realistic as possible.
基金funded by theNational Science and Technology Council of Taiwan under the grant number NSTC 113-2221-E-035-058.
文摘With the rapid expansion of multimedia data,protecting digital information has become increasingly critical.Reversible data hiding offers an effective solution by allowing sensitive information to be embedded in multimedia files while enabling full recovery of the original data after extraction.Audio,as a vital medium in communication,entertainment,and information sharing,demands the same level of security as images.However,embedding data in encrypted audio poses unique challenges due to the trade-offs between security,data integrity,and embedding capacity.This paper presents a novel interpolation-based reversible data hiding algorithm for encrypted audio that achieves scalable embedding capacity.By increasing sample density through interpolation,embedding opportunities are significantly enhanced while maintaining encryption throughout the process.The method further integrates multiple most significant bit(multi-MSB)prediction and Huffman coding to optimize compression and embedding efficiency.Experimental results on standard audio datasets demonstrate the proposed algorithm’s ability to embed up to 12.47 bits per sample with over 9.26 bits per sample available for pure embedding capacity,while preserving full reversibility.These results confirm the method’s suitability for secure applications that demand high embedding capacity and perfect reconstruction of original audio.This work advances reversible data hiding in encrypted audio by offering a secure,efficient,and fully reversible data hiding framework.
基金Supported by the National Natural Science Foundation of China(No.62172352,62171143,42306218)the Guangdong Provincial Department of Education Ocean Ranch Equipment Information and Intelligent Innovation Team Project(No.2023KCXTD016)+1 种基金the Natural Science Foundation of Hebei Province(No.2023407003)the Guangdong Ocean University Research Fund Project(No.060302102304).
文摘In order to solve the problems of limited performance of single and homogeneous sensors and high false alarm rate caused by environmental noise,heterogeneous sensor data fusion(HSDF),a fall detection framework based on heterogeneous sensor fusion of acceleration and audio signals,is proposed in this paper.By analyzing the heterogeneity of acceleration data and audio data,the framework uses Dempstershafer Theory(D-S Theory)to integrate the output of acceleration and audio data at decision level.Firstly,a normalized window interception algorithm——anomaly location window algorithm(ALW)is proposed by analyzing the fall process and the characteristics of acceleration changes based on acceleration data.Secondly,the one-dimensional residual convolutional network(1D-ReCNN)is designed for fall detection based on the audio data.Finally,it is verified that the HSDF framework has good advantages in terms of sensitivity and false alarm rate by the collection of volunteers’simulated fall data and free living data in real environment.