Deep learning significantly improves the accuracy of remote sensing image scene classification,benefiting from the large-scale datasets.However,annotating the remote sensing images is time-consuming and even tough for...Deep learning significantly improves the accuracy of remote sensing image scene classification,benefiting from the large-scale datasets.However,annotating the remote sensing images is time-consuming and even tough for experts.Deep neural networks trained using a few labeled samples usually generalize less to new unseen images.In this paper,we propose a semi-supervised approach for remote sensing image scene classification based on the prototype-based consistency,by exploring massive unlabeled images.To this end,we,first,propose a feature enhancement module to extract discriminative features.This is achieved by focusing the model on the foreground areas.Then,the prototype-based classifier is introduced to the framework,which is used to acquire consistent feature representations.We conduct a series of experiments on NWPU-RESISC45 and Aerial Image Dataset(AID).Our method improves the State-Of-The-Art(SOTA)method on NWPU-RESISC45 from 92.03%to 93.08%and on AID from 94.25%to 95.24%in terms of accuracy.展开更多
Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improve...Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improvements over deep CNNs, such as Visual Geometry Group Net(VGG-Net), we have figured out how to simplify the VGG-Net style architecture to a shallow CNN with improved performance. Max pooling and batch normalization are also applied for better accuracy. With a series of controlled tests on detection and classification of acoustic scenes and events(DCASE) 2016 data sets, our shallow CNN achieves 6.7% improvement, and reduces time complexity to 5%, compared with the VGG-Net style CNN.展开更多
Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency info...Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.展开更多
Latest advancements in the integration of camera sensors paves a way for newUnmannedAerialVehicles(UAVs)applications such as analyzing geographical(spatial)variations of earth science in mitigating harmful environment...Latest advancements in the integration of camera sensors paves a way for newUnmannedAerialVehicles(UAVs)applications such as analyzing geographical(spatial)variations of earth science in mitigating harmful environmental impacts and climate change.UAVs have achieved significant attention as a remote sensing environment,which captures high-resolution images from different scenes such as land,forest fire,flooding threats,road collision,landslides,and so on to enhance data analysis and decision making.Dynamic scene classification has attracted much attention in the examination of earth data captured by UAVs.This paper proposes a new multi-modal fusion based earth data classification(MMF-EDC)model.The MMF-EDC technique aims to identify the patterns that exist in the earth data and classifies them into appropriate class labels.The MMF-EDC technique involves a fusion of histogram of gradients(HOG),local binary patterns(LBP),and residual network(ResNet)models.This fusion process integrates many feature vectors and an entropy based fusion process is carried out to enhance the classification performance.In addition,the quantum artificial flora optimization(QAFO)algorithm is applied as a hyperparameter optimization technique.The AFO algorithm is inspired by the reproduction and the migration of flora helps to decide the optimal parameters of the ResNet model namely learning rate,number of hidden layers,and their number of neurons.Besides,Variational Autoencoder(VAE)based classification model is applied to assign appropriate class labels for a useful set of feature vectors.The proposedMMF-EDCmodel has been tested using UCM and WHU-RS datasets.The proposed MMFEDC model attains exhibits promising classification results on the applied remote sensing images with the accuracy of 0.989 and 0.994 on the test UCM and WHU-RS dataset respectively.展开更多
With the rapid development of computer technology,millions of images are produced everyday by different sources.How to efficiently process these images and accurately discern the scene in them becomes an important but...With the rapid development of computer technology,millions of images are produced everyday by different sources.How to efficiently process these images and accurately discern the scene in them becomes an important but tough task.In this paper,we propose a novel supervised learning framework based on proposed adaptive binary coding for scene classification.Specifically,we first extract some high-level features of images under consideration based on available models trained on public datasets.Then,we further design a binary encoding method called one-hot encoding to make the feature representation more efficient.Benefiting from the proposed adaptive binary coding,our method is free of time to train or fine-tune the deep network and can effectively handle different applications.Experimental results on three public datasets,i.e.,UIUC sports event dataset,MIT Indoor dataset,and UC Merced dataset in terms of three different classifiers,demonstrate that our method is superior to the state-of-the-art methods with large margins.展开更多
The process of human natural scene categorization consists of two correlated stages: visual perception and visual cognition of natural scenes.Inspired by this fact,we propose a biologically plausible approach for natu...The process of human natural scene categorization consists of two correlated stages: visual perception and visual cognition of natural scenes.Inspired by this fact,we propose a biologically plausible approach for natural scene image classification.This approach consists of one visual perception model and two visual cognition models.The visual perception model,composed of two steps,is used to extract discriminative features from natural scene images.In the first step,we mimic the oriented and bandpass properties of human primary visual cortex by a special complex wavelets transform,which can decompose a natural scene image into a series of 2D spatial structure signals.In the second step,a hybrid statistical feature extraction method is used to generate gist features from those 2D spatial structure signals.Then we design a cognitive feedback model to realize adaptive optimization for the visual perception model.At last,we build a multiple semantics based cognition model to imitate human cognitive mode in rapid natural scene categorization.Experiments on natural scene datasets show that the proposed method achieves high efficiency and accuracy for natural scene classification.展开更多
Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progr...Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progress in these methods primarily focuses on enhancing feature representation capabilities to improve performance.The challenge lies in the limited spatial resolution of small-sized remote sensing images,as well as image blurring and sparse data.These factors contribute to lower accuracy in current deep learning models.Additionally,deeper networks with attention-based modules require a substantial number of network parameters,leading to high computational costs and memory usage.In this article,we introduce ERSNet,a lightweight novel attention-guided network for remote sensing scene image classification.ERSNet is constructed using a deep separable convolutional network and incorporates an attention mechanism.It utilizes spatial attention,channel attention,and channel self-attention to enhance feature representation and accuracy,while also reducing computational complexity and memory usage.Experimental results indicate that,compared to existing state-of-the-art methods,ERSNet has a significantly lower parameter count of only 1.2 M and reduced Flops.It achieves the highest classification accuracy of 99.14%on the EuroSAT dataset,demonstrating its suitability for application on mobile terminal devices.Furthermore,experimental results from the UCMerced land use dataset and the Brazilian coffee scene also confirm the strong generalization ability of this method.展开更多
Remote sensing image scene classification and remote sensing technology applications are hot research topics.Although CNN-based models have reached high average accuracy,some classes are still misclassified,such as“f...Remote sensing image scene classification and remote sensing technology applications are hot research topics.Although CNN-based models have reached high average accuracy,some classes are still misclassified,such as“freeway,”“spare residential,”and“commercial_area.”These classes contain typical decisive features,spatial-relation features,and mixed decisive and spatial-relation features,which limit high-quality image scene classification.To address this issue,this paper proposes a Grad-CAM and capsule network hybrid method for image scene classification.The Grad-CAM and capsule network structures have the potential to recognize decisive features and spatial-relation features,respectively.By using a pre-trained model,hybrid structure,and structure adjustment,the proposed model can recognize both decisive and spatial-relation features.A group of experiments is designed on three popular data sets with increasing classification difficulties.In the most advanced experiment,92.67%average accuracy is achieved.Specifically,83%,75%,and 86%accuracies are obtained in the classes of“church,”“palace,”and“commercial_area,”respectively.This research demonstrates that the hybrid structure can effectively improve performance by considering both decisive and spatial-relation features.Therefore,Grad-CAM-CapsNet is a promising and powerful structure for image scene classification.展开更多
Remote sensing plays a pivotal role in environmental monitoring,disaster relief,and urban planning,where accurate scene classification of aerial images is essential.However,conventional convolutional neural networks(C...Remote sensing plays a pivotal role in environmental monitoring,disaster relief,and urban planning,where accurate scene classification of aerial images is essential.However,conventional convolutional neural networks(CNNs)struggle with long-range dependencies and preserving high-resolution features,limiting their effectiveness in complex aerial image analysis.To address these challenges,we propose a Hybrid HRNet-Swin Transformer model that synergizes the strengths of HRNet-W48 for high-resolution segmentation and the Swin Transformer for global feature extraction.This hybrid architecture ensures robust multi-scale feature fusion,capturing fine-grained details and broader contextual relationships in aerial imagery.Our methodology begins with preprocessing steps,including normalization,histogram equalization,and noise reduction,to enhance input data quality.The HRNet-W48 backbone maintains high-resolution feature maps throughout the network,enabling precise segmentation,while the Swin Transformer leverages hierarchical self-attention to model long-range dependencies efficiently.By integrating these components,our model achieves superior performance in segmentation and classification tasks compared to traditional CNNs and standalone transformer models.We evaluate our approach on two benchmark datasets:UC Merced and WHU-RS19.Experimental results demonstrate that the proposed hybrid model outperforms existing methods,achieving state-of-the-art accuracy while maintaining computational efficiency.Specifically,it excels in preserving fine spatial details and contextual understanding,critical for applications like land-use classification and disaster assessment.展开更多
Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learni...Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learning have significantly enhanced road scene classification,simultaneously achieving high accuracy,computational efficiency,and adaptability across diverse conditions continues to be difficult.To address these challenges,this study proposes HybridLSTM,a novel and efficient framework that integrates deep learning-based,object-based,and handcrafted feature extraction methods within a unified architecture.HybridLSTM is designed to classify four distinct road scene categories—crosswalk(CW),highway(HW),overpass/tunnel(OP/T),and parking(P)—by leveraging multiple publicly available datasets,including Places-365,BDD100K,LabelMe,and KITTI,thereby promoting domain generalization.The framework fuses object-level features extracted using YOLOv5 and VGG19,scene-level global representations obtained from a modified VGG19,and fine-grained texture features captured through eight handcrafted descriptors.This hybrid feature fusion enables the model to capture both semantic context and low-level visual cues,which are critical for robust scene understanding.To model spatial arrangements and latent sequential dependencies present even in static imagery,the combined features are processed through a Long Short-Term Memory(LSTM)network,allowing the extraction of discriminative patterns across heterogeneous feature spaces.Extensive experiments conducted on 2725 annotated road scene images,with an 80:20 training-to-testing split,validate the effectiveness of the proposed model.HybridLSTM achieves a classification accuracy of 96.3%,a precision of 95.8%,a recall of 96.1%,and an F1-score of 96.0%,outperforming several existing state-of-the-art methods.These results demonstrate the robustness,scalability,and generalization capability of HybridLSTM across varying environments and scene complexities.Moreover,the framework is optimized to balance classification performance with computational efficiency,making it highly suitable for real-time deployment in embedded autonomous driving systems.Future work will focus on extending the model to multi-class detection within a single frame and optimizing it further for edge-device deployments to reduce computational overhead in practical applications.展开更多
The scene classification plays an essential role in processing very high resolution(VHR)images for understanding.The scene classification in remote sensing faces two difficulties:the mismatching features caused by the...The scene classification plays an essential role in processing very high resolution(VHR)images for understanding.The scene classification in remote sensing faces two difficulties:the mismatching features caused by the model overfitting problem and the semantic information losing problem.The multi-task method helps solve the problems by using the share weights of multiply tasks.We propose a feature boosting method with a multi-task framework that combines the scene classification task and the semantic segmentation task to overcome the difficulties.Different from the traditional multi-task learning method,the two tasks are coupled together via a weakly supervised learning method so that it does not require the labelled semantic segmentation samples.First,we proposed a weakly supervised segmentation method to create the interconnection of the segmentation task and the classification task.And we achieve a coarse segmentation result which is highly correlated to the classification by the weakly supervised method.Second,according to the surface distribution of remote sensing,we propose a sparse surface constraint to obtain fine segmentation results.Fine features are obtained by constraining the shared weights of the weakly supervised segmentation method.Last,we classify the scenes using the fine features and conduct experiments on the public remote sensing scene classification datasets.Experimental results demonstrate that the proposed coupled multi-task model outperforms the stateof-the-art methods on remote sensing scene classification.展开更多
Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networ...Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.展开更多
Over the past decade,the significant growth of the convolutional neural network(CNN)based on deep learning(DL)approaches has greatly improved the machine learning(ML)algorithm’s performance on the semantic scene clas...Over the past decade,the significant growth of the convolutional neural network(CNN)based on deep learning(DL)approaches has greatly improved the machine learning(ML)algorithm’s performance on the semantic scene classification(SSC)of remote sensing images(RSI).However,the unbalanced attention to classification accuracy and efficiency has made the superiority of DL-based algorithms,e.g.,automation and simplicity,partially lost.Traditional ML strategies(e.g.,the handcrafted features or indicators)and accuracy-aimed strategies with a high trade-off(e.g.,the multi-stage CNNs and ensemble of multi-CNNs)are widely used without any training efficiency optimization involved,which may result in suboptimal performance.To address this problem,we propose a fast and simple training CNN framework(named FST-EfficientNet)for RSI-SSC based on an EfficientNetversion2 small(EfficientNetV2-S)CNN model.The whole algorithm flow is completely one-stage and end-to-end without any handcrafted features or discriminators introduced.In the implementation of training efficiency optimization,only several routine data augmentation tricks coupled with a fixed ratio of resolution or a gradually increasing resolution strategy are employed,so that the algorithm’s trade-off is very cheap.The performance evaluation shows that our FST-EfficientNet achieves new state-of-the-art(SOTA)records in the overall accuracy(OA)with about 0.8%to 2.7%ahead of all earlier methods on the Aerial Image Dataset(AID)and Northwestern Poly-technical University Remote Sensing Image Scene Classification 45 Dataset(NWPU-RESISC45D).Meanwhile,the results also demonstrate the importance and indispensability of training efficiency optimization strategies for RSI-SSC by DL.In fact,it is not necessary to gain better classification accuracy by completely relying on an excessive trade-off without efficiency.Ultimately,these findings are expected to contribute to the development of more efficient CNN-based approaches in RSI-SSC.展开更多
Purpose–The purpose of this paper is to build a classification system which mimics the perceptual ability of human vision,in gathering knowledge about the structure,content and the surrounding environment of a real-w...Purpose–The purpose of this paper is to build a classification system which mimics the perceptual ability of human vision,in gathering knowledge about the structure,content and the surrounding environment of a real-world natural scene,at a quick glance accurately.This paper proposes a set of novel features to determine the gist of a given scene based on dominant color,dominant direction,openness and roughness features.Design/methodology/approach–The classification system is designed at two different levels.At the first level,a set of low level features are extracted for each semantic feature.At the second level the extracted features are subjected to the process of feature evaluation,based on inter-class and intra-class distances.The most discriminating features are retained and used for training the support vector machine(SVM)classifier for two different data sets.Findings–Accuracy of the proposed system has been evaluated on two data sets:the well-known Oliva-Torralba data set and the customized image data set comprising of high-resolution images of natural landscapes.The experimentation on these two data sets with the proposed novel feature set and SVM classifier has provided 92.68 percent average classification accuracy,using ten-fold cross validation approach.The set of proposed features efficiently represent visual information and are therefore capable of narrowing the semantic gap between low-level image representation and high-level human perception.Originality/value–The method presented in this paper represents a new approach for extracting low-level features of reduced dimensionality that is able to model human perception for the task of scene classification.The methods of mapping primitive features to high-level features are intuitive to the user and are capable of reducing the semantic gap.The proposed feature evaluation technique is general and can be applied across any domain.展开更多
Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this two-part paper identifies an innovative,but realistic EO optical sensory imagederived semantics-enriched An...Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this two-part paper identifies an innovative,but realistic EO optical sensory imagederived semantics-enriched Analysis Ready Data(ARD)productpair and process gold standard as linchpin for success of a new notion of Space Economy 4.0.To be implemented in operational mode at the space segment and/or midstream segment by both public and private EO big data providers,it is regarded as necessarybut-not-sufficient“horizontal”(enabling)precondition for:(I)Transforming existing EO big raster-based data cubes at the midstream segment,typically affected by the so-called data-rich information-poor syndrome,into a new generation of semanticsenabled EO big raster-based numerical data and vector-based categorical(symbolic,semi-symbolic or subsymbolic)information cube management systems,eligible for semantic content-based image retrieval and semantics-enabled information/knowledge discovery.(II)Boosting the downstream segment in the development of an ever-increasing ensemble of“vertical”(deep and narrow,user-specific and domain-dependent)value–adding information products and services,suitable for a potentially huge worldwide market of institutional and private end-users of space technology.For the sake of readability,this paper consists of two parts.In the present Part 1,first,background notions in the remote sensing metascience domain are critically revised for harmonization across the multidisciplinary domain of cognitive science.In short,keyword“information”is disambiguated into the two complementary notions of quantitative/unequivocal information-as-thing and qualitative/equivocal/inherently ill-posed information-as-data-interpretation.Moreover,buzzword“artificial intelligence”is disambiguated into the two better-constrained notions of Artificial Narrow Intelligence as part-without-inheritance-of AGI.Second,based on a betterdefined and better-understood vocabulary of multidisciplinary terms,existing EO optical sensory image-derived Level 2/ARD products and processes are investigated at the Marr five levels of understanding of an information processing system.To overcome their drawbacks,an innovative,but realistic EO optical sensory image-derived semantics-enriched ARD product-pair and process gold standard is proposed in the subsequent Part 2.展开更多
Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this paper consists of two parts.In the previous Part 1,existing EO optical sensory imagederived Level 2/Analysi...Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this paper consists of two parts.In the previous Part 1,existing EO optical sensory imagederived Level 2/Analysis Ready Data(ARD)products and processes are critically compared,to overcome their lack of harmonization/standardization/interoperability and suitability in a new notion of Space Economy 4.0.In the present Part 2,original contributions comprise,at the Marr five levels of system understanding:(1)an innovative,but realistic EO optical sensory image-derived semantics-enriched ARD co-product pair requirements specification.First,in the pursuit of third-level semantic/ontological interoperability,a novel ARD symbolic(categorical and semantic)co-product,known as Scene Classification Map(SCM),adopts an augmented Cloud versus Not-Cloud taxonomy,whose Not-Cloud class legend complies with the standard fully-nested Land Cover Classification System’s Dichotomous Phase taxonomy proposed by the United Nations Food and Agriculture Organization.Second,a novel ARD subsymbolic numerical co-product,specifically,a panchromatic or multispectral EO image whose dimensionless digital numbers are radiometrically calibrated into a physical unit of radiometric measure,ranging from top-of-atmosphere reflectance to surface reflectance and surface albedo values,in a five-stage radiometric correction sequence.(2)An original ARD process requirements specification.(3)An innovative ARD processing system design(architecture),where stepwise SCM generation and stepwise SCM-conditional EO optical image radiometric correction are alternated in sequence.(4)An original modular hierarchical hybrid(combined deductive and inductive)computer vision subsystem design,provided with feedback loops,where software solutions at the Marr two shallowest levels of system understanding,specifically,algorithm and implementation,are selected from the scientific literature,to benefit from their technology readiness level as proof of feasibility,required in addition to proven suitability.To be implemented in operational mode at the space segment and/or midstream segment by both public and private EO big data providers,the proposed EO optical sensory image-derived semantics-enriched ARD product-pair and process reference standard is highlighted as linchpin for success of a new notion of Space Economy 4.0.展开更多
The Tibetan Plateau(TP),known as the“Third Pole”of the Earth and“Asian Water Tower”,is the magnifier of global climate change and birthplace of many large rivers in Asia.There are unique alpine wetlands on the TP,...The Tibetan Plateau(TP),known as the“Third Pole”of the Earth and“Asian Water Tower”,is the magnifier of global climate change and birthplace of many large rivers in Asia.There are unique alpine wetlands on the TP,accounting for 20%of Chinese wetlands area,and the lakes alone constitute half of the national lake area.Wetlands are critical to human survival and development as one of the three major ecosystems[1].展开更多
基金supported in part by the National Natural Science Foundation of China(No.12302252)。
文摘Deep learning significantly improves the accuracy of remote sensing image scene classification,benefiting from the large-scale datasets.However,annotating the remote sensing images is time-consuming and even tough for experts.Deep neural networks trained using a few labeled samples usually generalize less to new unseen images.In this paper,we propose a semi-supervised approach for remote sensing image scene classification based on the prototype-based consistency,by exploring massive unlabeled images.To this end,we,first,propose a feature enhancement module to extract discriminative features.This is achieved by focusing the model on the foreground areas.Then,the prototype-based classifier is introduced to the framework,which is used to acquire consistent feature representations.We conduct a series of experiments on NWPU-RESISC45 and Aerial Image Dataset(AID).Our method improves the State-Of-The-Art(SOTA)method on NWPU-RESISC45 from 92.03%to 93.08%and on AID from 94.25%to 95.24%in terms of accuracy.
基金Supported by the National Natural Science Foundation of China(61102127,61231015)National High Technology Research and Development Program of China(863 Program,2015AA016306)+3 种基金National Key Research and Development Program(2016YFB0502204)the Innovation Fund of Shanghai Aerospace Science and Technology(SAST,2015014)the Key Technology R&D Program of Hubei Provence(2014BAA153)SKLSE-2015-A-06
文摘Recently, deep neural networks, which include convolutional neural networks(CNNs), have been widely applied to acoustic scene classification(ASC). Motivated by the fact that some simplified CNNs have shown improvements over deep CNNs, such as Visual Geometry Group Net(VGG-Net), we have figured out how to simplify the VGG-Net style architecture to a shallow CNN with improved performance. Max pooling and batch normalization are also applied for better accuracy. With a series of controlled tests on detection and classification of acoustic scenes and events(DCASE) 2016 data sets, our shallow CNN achieves 6.7% improvement, and reduces time complexity to 5%, compared with the VGG-Net style CNN.
基金supported by the German National BMBF IKT2020-Grant(16SV7213)(EmotAsS)the European-Unions Horizon 2020 Research and Innovation Programme(688835)(DE-ENIGMA)the China Scholarship Council(CSC)
文摘Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.
基金The authors would like to thank the Taif University for funding this work through Taif University Research Supporting,Project Number.(TURSP-2020/277),Taif University,Taif,Saudi Arabia.
文摘Latest advancements in the integration of camera sensors paves a way for newUnmannedAerialVehicles(UAVs)applications such as analyzing geographical(spatial)variations of earth science in mitigating harmful environmental impacts and climate change.UAVs have achieved significant attention as a remote sensing environment,which captures high-resolution images from different scenes such as land,forest fire,flooding threats,road collision,landslides,and so on to enhance data analysis and decision making.Dynamic scene classification has attracted much attention in the examination of earth data captured by UAVs.This paper proposes a new multi-modal fusion based earth data classification(MMF-EDC)model.The MMF-EDC technique aims to identify the patterns that exist in the earth data and classifies them into appropriate class labels.The MMF-EDC technique involves a fusion of histogram of gradients(HOG),local binary patterns(LBP),and residual network(ResNet)models.This fusion process integrates many feature vectors and an entropy based fusion process is carried out to enhance the classification performance.In addition,the quantum artificial flora optimization(QAFO)algorithm is applied as a hyperparameter optimization technique.The AFO algorithm is inspired by the reproduction and the migration of flora helps to decide the optimal parameters of the ResNet model namely learning rate,number of hidden layers,and their number of neurons.Besides,Variational Autoencoder(VAE)based classification model is applied to assign appropriate class labels for a useful set of feature vectors.The proposedMMF-EDCmodel has been tested using UCM and WHU-RS datasets.The proposed MMFEDC model attains exhibits promising classification results on the applied remote sensing images with the accuracy of 0.989 and 0.994 on the test UCM and WHU-RS dataset respectively.
基金supported by the National Key R&D Program of China 2018YFB1003205by the National Natural Science Foundation of China U1836208,U1536206,U1836110,61972207+2 种基金by the Engineering Research Center of Digital Forensics,Ministry of Educationby the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundby the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China。
文摘With the rapid development of computer technology,millions of images are produced everyday by different sources.How to efficiently process these images and accurately discern the scene in them becomes an important but tough task.In this paper,we propose a novel supervised learning framework based on proposed adaptive binary coding for scene classification.Specifically,we first extract some high-level features of images under consideration based on available models trained on public datasets.Then,we further design a binary encoding method called one-hot encoding to make the feature representation more efficient.Benefiting from the proposed adaptive binary coding,our method is free of time to train or fine-tune the deep network and can effectively handle different applications.Experimental results on three public datasets,i.e.,UIUC sports event dataset,MIT Indoor dataset,and UC Merced dataset in terms of three different classifiers,demonstrate that our method is superior to the state-of-the-art methods with large margins.
文摘The process of human natural scene categorization consists of two correlated stages: visual perception and visual cognition of natural scenes.Inspired by this fact,we propose a biologically plausible approach for natural scene image classification.This approach consists of one visual perception model and two visual cognition models.The visual perception model,composed of two steps,is used to extract discriminative features from natural scene images.In the first step,we mimic the oriented and bandpass properties of human primary visual cortex by a special complex wavelets transform,which can decompose a natural scene image into a series of 2D spatial structure signals.In the second step,a hybrid statistical feature extraction method is used to generate gist features from those 2D spatial structure signals.Then we design a cognitive feedback model to realize adaptive optimization for the visual perception model.At last,we build a multiple semantics based cognition model to imitate human cognitive mode in rapid natural scene categorization.Experiments on natural scene datasets show that the proposed method achieves high efficiency and accuracy for natural scene classification.
文摘Remote sensing scene image classification is a prominent research area within remote sensing.Deep learningbased methods have been extensively utilized and have shown significant advancements in this field.Recent progress in these methods primarily focuses on enhancing feature representation capabilities to improve performance.The challenge lies in the limited spatial resolution of small-sized remote sensing images,as well as image blurring and sparse data.These factors contribute to lower accuracy in current deep learning models.Additionally,deeper networks with attention-based modules require a substantial number of network parameters,leading to high computational costs and memory usage.In this article,we introduce ERSNet,a lightweight novel attention-guided network for remote sensing scene image classification.ERSNet is constructed using a deep separable convolutional network and incorporates an attention mechanism.It utilizes spatial attention,channel attention,and channel self-attention to enhance feature representation and accuracy,while also reducing computational complexity and memory usage.Experimental results indicate that,compared to existing state-of-the-art methods,ERSNet has a significantly lower parameter count of only 1.2 M and reduced Flops.It achieves the highest classification accuracy of 99.14%on the EuroSAT dataset,demonstrating its suitability for application on mobile terminal devices.Furthermore,experimental results from the UCMerced land use dataset and the Brazilian coffee scene also confirm the strong generalization ability of this method.
基金funded by the open fund of the Key Laboratory of Jianghuai Arable Land Resources Protection and Eco-restoration(Ministry of Natural Resources)(No.2022-ARPE-KF04)the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation(Ministry of Natural Resources)(No.KF-2020-05-084).
文摘Remote sensing image scene classification and remote sensing technology applications are hot research topics.Although CNN-based models have reached high average accuracy,some classes are still misclassified,such as“freeway,”“spare residential,”and“commercial_area.”These classes contain typical decisive features,spatial-relation features,and mixed decisive and spatial-relation features,which limit high-quality image scene classification.To address this issue,this paper proposes a Grad-CAM and capsule network hybrid method for image scene classification.The Grad-CAM and capsule network structures have the potential to recognize decisive features and spatial-relation features,respectively.By using a pre-trained model,hybrid structure,and structure adjustment,the proposed model can recognize both decisive and spatial-relation features.A group of experiments is designed on three popular data sets with increasing classification difficulties.In the most advanced experiment,92.67%average accuracy is achieved.Specifically,83%,75%,and 86%accuracies are obtained in the classes of“church,”“palace,”and“commercial_area,”respectively.This research demonstrates that the hybrid structure can effectively improve performance by considering both decisive and spatial-relation features.Therefore,Grad-CAM-CapsNet is a promising and powerful structure for image scene classification.
基金supported by the ITP(Institute of Information&Communications Technology Planning&Evaluation)-ICAN(ICT Challenge and Advanced Network of HRD)(ITP-2025-RS-2022-00156326,33)grant funded by the Korea government(Ministry of Science and ICT)the Deanship of Research and Graduate Studies at King Khalid University for funding this work through the Large Group Project under grant number(RGP2/568/45)the Deanship of Scientific Research at Northern Border University,Arar,Saudi Arabia,for funding this research work through the Project Number"NBU-FFR-2025-231-03".
文摘Remote sensing plays a pivotal role in environmental monitoring,disaster relief,and urban planning,where accurate scene classification of aerial images is essential.However,conventional convolutional neural networks(CNNs)struggle with long-range dependencies and preserving high-resolution features,limiting their effectiveness in complex aerial image analysis.To address these challenges,we propose a Hybrid HRNet-Swin Transformer model that synergizes the strengths of HRNet-W48 for high-resolution segmentation and the Swin Transformer for global feature extraction.This hybrid architecture ensures robust multi-scale feature fusion,capturing fine-grained details and broader contextual relationships in aerial imagery.Our methodology begins with preprocessing steps,including normalization,histogram equalization,and noise reduction,to enhance input data quality.The HRNet-W48 backbone maintains high-resolution feature maps throughout the network,enabling precise segmentation,while the Swin Transformer leverages hierarchical self-attention to model long-range dependencies efficiently.By integrating these components,our model achieves superior performance in segmentation and classification tasks compared to traditional CNNs and standalone transformer models.We evaluate our approach on two benchmark datasets:UC Merced and WHU-RS19.Experimental results demonstrate that the proposed hybrid model outperforms existing methods,achieving state-of-the-art accuracy while maintaining computational efficiency.Specifically,it excels in preserving fine spatial details and contextual understanding,critical for applications like land-use classification and disaster assessment.
文摘Recognizing road scene context from a single image remains a critical challenge for intelligent autonomous driving systems,particularly in dynamic and unstructured environments.While recent advancements in deep learning have significantly enhanced road scene classification,simultaneously achieving high accuracy,computational efficiency,and adaptability across diverse conditions continues to be difficult.To address these challenges,this study proposes HybridLSTM,a novel and efficient framework that integrates deep learning-based,object-based,and handcrafted feature extraction methods within a unified architecture.HybridLSTM is designed to classify four distinct road scene categories—crosswalk(CW),highway(HW),overpass/tunnel(OP/T),and parking(P)—by leveraging multiple publicly available datasets,including Places-365,BDD100K,LabelMe,and KITTI,thereby promoting domain generalization.The framework fuses object-level features extracted using YOLOv5 and VGG19,scene-level global representations obtained from a modified VGG19,and fine-grained texture features captured through eight handcrafted descriptors.This hybrid feature fusion enables the model to capture both semantic context and low-level visual cues,which are critical for robust scene understanding.To model spatial arrangements and latent sequential dependencies present even in static imagery,the combined features are processed through a Long Short-Term Memory(LSTM)network,allowing the extraction of discriminative patterns across heterogeneous feature spaces.Extensive experiments conducted on 2725 annotated road scene images,with an 80:20 training-to-testing split,validate the effectiveness of the proposed model.HybridLSTM achieves a classification accuracy of 96.3%,a precision of 95.8%,a recall of 96.1%,and an F1-score of 96.0%,outperforming several existing state-of-the-art methods.These results demonstrate the robustness,scalability,and generalization capability of HybridLSTM across varying environments and scene complexities.Moreover,the framework is optimized to balance classification performance with computational efficiency,making it highly suitable for real-time deployment in embedded autonomous driving systems.Future work will focus on extending the model to multi-class detection within a single frame and optimizing it further for edge-device deployments to reduce computational overhead in practical applications.
基金supported in part by the National Natural Science Foundation of Key International Cooperation(Grant No.61720106002)the National Natural Science Foundation for Outstanding Scholars(Grant No.62025107)the National Natural Science Foundation of China(Grant No.61901141)。
文摘The scene classification plays an essential role in processing very high resolution(VHR)images for understanding.The scene classification in remote sensing faces two difficulties:the mismatching features caused by the model overfitting problem and the semantic information losing problem.The multi-task method helps solve the problems by using the share weights of multiply tasks.We propose a feature boosting method with a multi-task framework that combines the scene classification task and the semantic segmentation task to overcome the difficulties.Different from the traditional multi-task learning method,the two tasks are coupled together via a weakly supervised learning method so that it does not require the labelled semantic segmentation samples.First,we proposed a weakly supervised segmentation method to create the interconnection of the segmentation task and the classification task.And we achieve a coarse segmentation result which is highly correlated to the classification by the weakly supervised method.Second,according to the surface distribution of remote sensing,we propose a sparse surface constraint to obtain fine segmentation results.Fine features are obtained by constraining the shared weights of the weakly supervised segmentation method.Last,we classify the scenes using the fine features and conduct experiments on the public remote sensing scene classification datasets.Experimental results demonstrate that the proposed coupled multi-task model outperforms the stateof-the-art methods on remote sensing scene classification.
基金This work was supported by Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)[No.2021-0-0268,Artificial Intelligence Innovation Hub(Artificial Intelligence Institute,Seoul National University)]。
文摘Acoustic scene classification(ASC)is a method of recognizing and classifying environments that employ acoustic signals.Various ASC approaches based on deep learning have been developed,with convolutional neural networks(CNNs)proving to be the most reliable and commonly utilized in ASC systems due to their suitability for constructing lightweight models.When using ASC systems in the real world,model complexity and device robustness are essential considerations.In this paper,we propose a two-pass mobile network for low-complexity classification of the acoustic scene,named TP-MobNet.With inverse residuals and linear bottlenecks,TPMobNet is based on MobileNetV2,and following mobile blocks,coordinate attention and two-pass fusion approaches are utilized.The log-range dependencies and precise position information in feature maps can be trained via coordinate attention.By capturing more diverse feature resolutions at the network’s end sides,two-pass fusions can also train generalization.Also,the model size is reduced by applying weight quantization to the trained model.By adding weight quantization to the trained model,the model size is also lowered.The TAU Urban Acoustic Scenes 2020 Mobile development set was used for all of the experiments.It has been confirmed that the proposed model,with a model size of 219.6 kB,achieves an accuracy of 73.94%.
基金This research has been supported by Doctoral Research funding from Hunan University of Arts and Science,Grant Number E07016033.
文摘Over the past decade,the significant growth of the convolutional neural network(CNN)based on deep learning(DL)approaches has greatly improved the machine learning(ML)algorithm’s performance on the semantic scene classification(SSC)of remote sensing images(RSI).However,the unbalanced attention to classification accuracy and efficiency has made the superiority of DL-based algorithms,e.g.,automation and simplicity,partially lost.Traditional ML strategies(e.g.,the handcrafted features or indicators)and accuracy-aimed strategies with a high trade-off(e.g.,the multi-stage CNNs and ensemble of multi-CNNs)are widely used without any training efficiency optimization involved,which may result in suboptimal performance.To address this problem,we propose a fast and simple training CNN framework(named FST-EfficientNet)for RSI-SSC based on an EfficientNetversion2 small(EfficientNetV2-S)CNN model.The whole algorithm flow is completely one-stage and end-to-end without any handcrafted features or discriminators introduced.In the implementation of training efficiency optimization,only several routine data augmentation tricks coupled with a fixed ratio of resolution or a gradually increasing resolution strategy are employed,so that the algorithm’s trade-off is very cheap.The performance evaluation shows that our FST-EfficientNet achieves new state-of-the-art(SOTA)records in the overall accuracy(OA)with about 0.8%to 2.7%ahead of all earlier methods on the Aerial Image Dataset(AID)and Northwestern Poly-technical University Remote Sensing Image Scene Classification 45 Dataset(NWPU-RESISC45D).Meanwhile,the results also demonstrate the importance and indispensability of training efficiency optimization strategies for RSI-SSC by DL.In fact,it is not necessary to gain better classification accuracy by completely relying on an excessive trade-off without efficiency.Ultimately,these findings are expected to contribute to the development of more efficient CNN-based approaches in RSI-SSC.
文摘Purpose–The purpose of this paper is to build a classification system which mimics the perceptual ability of human vision,in gathering knowledge about the structure,content and the surrounding environment of a real-world natural scene,at a quick glance accurately.This paper proposes a set of novel features to determine the gist of a given scene based on dominant color,dominant direction,openness and roughness features.Design/methodology/approach–The classification system is designed at two different levels.At the first level,a set of low level features are extracted for each semantic feature.At the second level the extracted features are subjected to the process of feature evaluation,based on inter-class and intra-class distances.The most discriminating features are retained and used for training the support vector machine(SVM)classifier for two different data sets.Findings–Accuracy of the proposed system has been evaluated on two data sets:the well-known Oliva-Torralba data set and the customized image data set comprising of high-resolution images of natural landscapes.The experimentation on these two data sets with the proposed novel feature set and SVM classifier has provided 92.68 percent average classification accuracy,using ten-fold cross validation approach.The set of proposed features efficiently represent visual information and are therefore capable of narrowing the semantic gap between low-level image representation and high-level human perception.Originality/value–The method presented in this paper represents a new approach for extracting low-level features of reduced dimensionality that is able to model human perception for the task of scene classification.The methods of mapping primitive features to high-level features are intuitive to the user and are capable of reducing the semantic gap.The proposed feature evaluation technique is general and can be applied across any domain.
文摘Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this two-part paper identifies an innovative,but realistic EO optical sensory imagederived semantics-enriched Analysis Ready Data(ARD)productpair and process gold standard as linchpin for success of a new notion of Space Economy 4.0.To be implemented in operational mode at the space segment and/or midstream segment by both public and private EO big data providers,it is regarded as necessarybut-not-sufficient“horizontal”(enabling)precondition for:(I)Transforming existing EO big raster-based data cubes at the midstream segment,typically affected by the so-called data-rich information-poor syndrome,into a new generation of semanticsenabled EO big raster-based numerical data and vector-based categorical(symbolic,semi-symbolic or subsymbolic)information cube management systems,eligible for semantic content-based image retrieval and semantics-enabled information/knowledge discovery.(II)Boosting the downstream segment in the development of an ever-increasing ensemble of“vertical”(deep and narrow,user-specific and domain-dependent)value–adding information products and services,suitable for a potentially huge worldwide market of institutional and private end-users of space technology.For the sake of readability,this paper consists of two parts.In the present Part 1,first,background notions in the remote sensing metascience domain are critically revised for harmonization across the multidisciplinary domain of cognitive science.In short,keyword“information”is disambiguated into the two complementary notions of quantitative/unequivocal information-as-thing and qualitative/equivocal/inherently ill-posed information-as-data-interpretation.Moreover,buzzword“artificial intelligence”is disambiguated into the two better-constrained notions of Artificial Narrow Intelligence as part-without-inheritance-of AGI.Second,based on a betterdefined and better-understood vocabulary of multidisciplinary terms,existing EO optical sensory image-derived Level 2/ARD products and processes are investigated at the Marr five levels of understanding of an information processing system.To overcome their drawbacks,an innovative,but realistic EO optical sensory image-derived semantics-enriched ARD product-pair and process gold standard is proposed in the subsequent Part 2.
基金ASAP 16 project call,project title:SemantiX-A cross-sensor semantic EO data cube to open and leverage essential climate variables with scientists and the public,Grant ID:878939ASAP 17 project call,project title:SIMS-Soil sealing identification and monitoring system,Grant ID:885365.
文摘Aiming at the convergence between Earth observation(EO)Big Data and Artificial General Intelligence(AGI),this paper consists of two parts.In the previous Part 1,existing EO optical sensory imagederived Level 2/Analysis Ready Data(ARD)products and processes are critically compared,to overcome their lack of harmonization/standardization/interoperability and suitability in a new notion of Space Economy 4.0.In the present Part 2,original contributions comprise,at the Marr five levels of system understanding:(1)an innovative,but realistic EO optical sensory image-derived semantics-enriched ARD co-product pair requirements specification.First,in the pursuit of third-level semantic/ontological interoperability,a novel ARD symbolic(categorical and semantic)co-product,known as Scene Classification Map(SCM),adopts an augmented Cloud versus Not-Cloud taxonomy,whose Not-Cloud class legend complies with the standard fully-nested Land Cover Classification System’s Dichotomous Phase taxonomy proposed by the United Nations Food and Agriculture Organization.Second,a novel ARD subsymbolic numerical co-product,specifically,a panchromatic or multispectral EO image whose dimensionless digital numbers are radiometrically calibrated into a physical unit of radiometric measure,ranging from top-of-atmosphere reflectance to surface reflectance and surface albedo values,in a five-stage radiometric correction sequence.(2)An original ARD process requirements specification.(3)An innovative ARD processing system design(architecture),where stepwise SCM generation and stepwise SCM-conditional EO optical image radiometric correction are alternated in sequence.(4)An original modular hierarchical hybrid(combined deductive and inductive)computer vision subsystem design,provided with feedback loops,where software solutions at the Marr two shallowest levels of system understanding,specifically,algorithm and implementation,are selected from the scientific literature,to benefit from their technology readiness level as proof of feasibility,required in addition to proven suitability.To be implemented in operational mode at the space segment and/or midstream segment by both public and private EO big data providers,the proposed EO optical sensory image-derived semantics-enriched ARD product-pair and process reference standard is highlighted as linchpin for success of a new notion of Space Economy 4.0.
基金supported by the Second Tibetan Plateau Scientific Expedition and Research Program(2019QZKK0608 and 2019QZKK0604)。
文摘The Tibetan Plateau(TP),known as the“Third Pole”of the Earth and“Asian Water Tower”,is the magnifier of global climate change and birthplace of many large rivers in Asia.There are unique alpine wetlands on the TP,accounting for 20%of Chinese wetlands area,and the lakes alone constitute half of the national lake area.Wetlands are critical to human survival and development as one of the three major ecosystems[1].