Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency info...Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.展开更多
Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improve...Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improvements over deep CNNs, such as Visual Geometry Group Net(VGG-Net), we have figured out how to simplify the VGG-Net style architecture to a shallow CNN with improved performance. Max pooling and batch normalization are also applied for better accuracy. With a series of controlled tests on detection and classification of acoustic scenes and events(DCASE) 2016 data sets, our shallow CNN achieves 6.7% improvement, and reduces time complexity to 5%, compared with the VGG-Net style CNN.展开更多
Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networ...Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.展开更多
基金supported by the German National BMBF IKT2020-Grant(16SV7213)(EmotAsS)the European-Unions Horizon 2020 Research and Innovation Programme(688835)(DE-ENIGMA)the China Scholarship Council(CSC)
文摘Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.
基金Supported by the National Natural Science Foundation of China(61102127,61231015)National High Technology Research and Development Program of China(863 Program,2015AA016306)+3 种基金National Key Research and Development Program(2016YFB0502204)the Innovation Fund of Shanghai Aerospace Science and Technology(SAST,2015014)the Key Technology R&D Program of Hubei Provence(2014BAA153)SKLSE-2015-A-06
文摘Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improvements over deep CNNs, such as Visual Geometry Group Net(VGG-Net), we have figured out how to simplify the VGG-Net style architecture to a shallow CNN with improved performance. Max pooling and batch normalization are also applied for better accuracy. With a series of controlled tests on detection and classification of acoustic scenes and events(DCASE) 2016 data sets, our shallow CNN achieves 6.7% improvement, and reduces time complexity to 5%, compared with the VGG-Net style CNN.
基金This work was supported by Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)[No.2021-0-0268,Artificial Intelligence Innovation Hub(Artificial Intelligence Institute,Seoul National University)]。
文摘Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.