Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization syst...Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.展开更多
This paper discusses the problem of classifying a multivariate Gaussian random field observation into one of the several categories specified by different parametric mean models. Investigation is conducted on the clas...This paper discusses the problem of classifying a multivariate Gaussian random field observation into one of the several categories specified by different parametric mean models. Investigation is conducted on the classifier based on plug-in Bayes classification rule (PBCR) formed by replacing unknown parameters in Bayes classification rule (BCR) with category parameters estimators. This is the extension of the previous one from the two category cases to the multi-category case. The novel closed-form expressions for the Bayes classification probability and actual correct classification rate associated with PBCR are derived. These correct classification rates are suggested as performance measures for the classifications procedure. An empirical study has been carried out to analyze the dependence of derived classification rates on category parameters.展开更多
The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parall...The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.展开更多
An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the...An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the base for the Naive Bayes classifier to approve the effectiveness of the domain ontology for document classification. The 1752 documents divided into 10 categories are used to assess the effectiveness of the ontology, where 1252 and 500 documents are the training and testing documents, respectively. The Fl-measure is as the assessment criteria and the following three results are obtained. The average recall of Naive Bayes classifier is 0.94. Therefore, in recall, the performance of Naive Bayes classifier is excellent based on the automatically constructed ontology. The average precision of Naive Bayes classifier is 0.81. Therefore, in precision, the performance of Naive Bayes classifier is gored based on the automatically constructed ontology. The average Fl-measure for 10 categories by Naive Bayes classifier is 0.86. Therefore, the performance of Naive Bayes classifier is effective based on the automatically constructed ontology in the point of F 1-measure. Thus, the domain ontology automatically constructed could indeed be acted as the document categories to reach the effectiveness for document classification.展开更多
Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for ...Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing.The communication using the Roman characters,which are used in the script of Urdu language on social media,is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply.English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past.This is due to the numerous complexities involved in the processing of Roman Urdu data.The complexities associated with Roman Urdu include the non-availability of the tagged corpus,lack of a set of rules,and lack of standardized spellings.A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook,Twitter but meaningful information can only be extracted if data is in a structured format.We have developed a Roman Urdu news headline classifier,which will help to classify news into relevant categories on which further analysis and modeling can be done.The author of this research aims to develop the Roman Urdu news classifier,which will classify the news into five categories(health,business,technology,sports,international).First,we will develop the news dataset using scraping tools and then after preprocessing,we will compare the results of different machine learning algorithms like Logistic Regression(LR),Multinomial Naïve Bayes(MNB),Long short term memory(LSTM),and Convolutional Neural Network(CNN).After this,we will use a phonetic algorithm to control lexical variation and test news from different websites.The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news.After applying above mentioned different machine learning algorithms,results have shown that Multinomial Naïve Bayes classifier is giving the best accuracy of 90.17%which is due to the noise lexical variation.展开更多
This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry...This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry important information.Organizing large amounts of data and extracting useful information is a challenging task.The extracted information can be considered as new knowledge and can be used for deci sion-making.We extract comments from YouTube on videos and categorized them in domain-specific,and then apply the Naïve Bayes classifier with improved techniques.Our method provided a decent 80%accuracy in classifying those comments.This experiment shows that the proposed method provides excellent adaptability for large-scale text classification.展开更多
Diabetic retinopathy(DR)is a disease with an increasing prevalence and the major reason for blindness among working-age population.The possibility of severe vision loss can be extensively reduced by timely diagnosis a...Diabetic retinopathy(DR)is a disease with an increasing prevalence and the major reason for blindness among working-age population.The possibility of severe vision loss can be extensively reduced by timely diagnosis and treatment.An automated screening for DR has been identified as an effective method for early DR detection,which can decrease the workload associated to manual grading as well as save diagnosis costs and time.Several studies have been carried out to develop automated detection and classification models for DR.This paper presents a new IoT and cloud-based deep learning for healthcare diagnosis of Diabetic Retinopathy(DR).The proposed model incorporates different processes namely data collection,preprocessing,segmentation,feature extraction and classification.At first,the IoT-based data collection process takes place where the patient wears a head mounted camera to capture the retinal fundus image and send to cloud server.Then,the contrast level of the input DR image gets increased in the preprocessing stage using Contrast Limited Adaptive Histogram Equalization(CLAHE)model.Next,the preprocessed image is segmented using Adaptive Spatial Kernel distance measure-based Fuzzy C-Means clustering(ASKFCM)model.Afterwards,deep Convolution Neural Network(CNN)based Inception v4 model is applied as a feature extractor and the resulting feature vectors undergo classification in line with the Gaussian Naive Bayes(GNB)model.The proposed model was tested using a benchmark DR MESSIDOR image dataset and the obtained results showcased superior performance of the proposed model over other such models compared in the study.展开更多
There is growing interest in using ecosystem services to aid development of management strategies that target sustainability and enhance ecosystem support to humans. Challenges remain in the search for methods and ind...There is growing interest in using ecosystem services to aid development of management strategies that target sustainability and enhance ecosystem support to humans. Challenges remain in the search for methods and indicators that can quantify ecosystem services using metrics that are meaningful in light of their high priorities. We developed a framework to link ecosystems to human wellbeing based on a stepwise approach. We evaluated prospective models in terms of their capacity to quantify national ecosystem services of forests. The most applicable models were subsequently used to quantify ecosystem services. The Korea Forest Research Institute model sat- isfied all criteria in its first practical use. A total of 12 key ecosystem services were identified. For our case study, we quantified four ecosystem functions, viz. water storage capacity in forest soil for water storage service, reduced suspended sediment for water purification service, reduced soil erosion for landslide prevention service, and reduced sediment yield for sediment regulation service. Water storage capacity in forest soil was estimated at 2142 t/ha, and reduced suspended sediment was estimated at 608 kg/ ha. Reduced soil erosion was estimated at 77 m^3/ha, and reduced sediment yield was estimated at 285 m^3/ha. These results were similar to those reported by previous studies. Mapped results revealed hotspots of ecosystem services around protected areas that were particularly rich in bio- diversity. In addition, the proposed framework illustrated that quantification of ecosystem services could be sup- ported by the spatial flow of ecosystem services. However, our approach did not address challenges faced when quantifying connections between ecosystem indicators and actual benefits of services described.展开更多
The maximum of k numerical functions defined on , , by , ??is used here in Statistical classification. Previously, it has been used in Statistical Discrimination [1] and in Clustering [2]. We present first some theore...The maximum of k numerical functions defined on , , by , ??is used here in Statistical classification. Previously, it has been used in Statistical Discrimination [1] and in Clustering [2]. We present first some theoretical results on this function, and then its application in classification using a computer program we have developed. This approach leads to clear decisions, even in cases where the extension to several classes of Fisher’s linear discriminant function fails to be effective.展开更多
Many companies like credit card, insurance, bank, retail industry require direct marketing. Data mining can help those institutes to set marketing goal. Data mining techniques have good prospects in their target audie...Many companies like credit card, insurance, bank, retail industry require direct marketing. Data mining can help those institutes to set marketing goal. Data mining techniques have good prospects in their target audiences and improve the likelihood of response. In this work we have investigated two data mining techniques: the Naive Bayes and the C4.5 decision tree algorithms. The goal of this work is to predict whether a client will subscribe a term deposit. We also made comparative study of performance of those two algorithms. Publicly available UCI data is used to train and test the performance of the algorithms. Besides, we extract actionable knowledge from decision tree that focuses to take interesting and important decision in business area.展开更多
In recent times among the multitude of attacks present in network system, DDoS attacks have emerged to be the attacks with the most devastating effects. The main objective of this paper is to propose a system that eff...In recent times among the multitude of attacks present in network system, DDoS attacks have emerged to be the attacks with the most devastating effects. The main objective of this paper is to propose a system that effectively detects DDoS attacks appearing in any networked system using the clustering technique of data mining followed by classification. This method uses a Heuristics Clustering Algorithm (HCA) to cluster the available data and Na?ve Bayes (NB) classification to classify the data and detect the attacks created in the system based on some network attributes of the data packet. The clustering algorithm is based in unsupervised learning technique and is sometimes unable to detect some of the attack instances and few normal instances, therefore classification techniques are also used along with clustering to overcome this classification problem and to enhance the accuracy. Na?ve Bayes classifiers are based on very strong independence assumptions with fairly simple construction to derive the conditional probability for each relationship. A series of experiment is performed using “The CAIDA UCSD DDoS Attack 2007 Dataset” and “DARPA 2000 Dataset” and the efficiency of the proposed system has been tested based on the following performance parameters: Accuracy, Detection Rate and False Positive Rate and the result obtained from the proposed system has been found that it has enhanced accuracy and detection rate with low false positive rate.展开更多
Arid and semiarid regions face challenges such as bushland encroachment and agricultural expansion,especially in Tiaty,Baringo,Kenya.These issues create mixed opportunities for pastoral and agro-pastoral livelihoods.M...Arid and semiarid regions face challenges such as bushland encroachment and agricultural expansion,especially in Tiaty,Baringo,Kenya.These issues create mixed opportunities for pastoral and agro-pastoral livelihoods.Machine learn-ing methods for land use and land cover(LULC)classification are vital for monitoring environmental changes.Remote sensing advancements increase the potential for classifying land cover,which requires assessing algorithm ac-curacy and efficiency for fragile environments.This research identifies the best algorithms for LULC monitoring and developing adaptive methods for sensi-tive ecosystems.Landsat-9 imagery from January to April 2023 facilitated land use class identification.Preprocessing in the Google Earth Engine applied spec-tral indices such as the NDVI,NDWI,BSI,and NDBI.Supervised classification uses random forest(RF),support vector machine(SVM),classification and re-gression trees(CARTs),gradient boosting trees(GBTs),and naïve Bayes.An accuracy assessment was used to determine the optimal classifiers for future land use analyses.The evaluation revealed that the RF model achieved 84.4%accuracy with a 0.85 weighted F1 score,indicating its effectiveness for complex LULC data.In contrast,the GBT and CART methods yielded moderate F1 scores(0.77 and 0.68),indicating the presence of overclassification and class imbalance issues.The SVM and naïve Bayes methods were less accurate,ren-dering them unsuitable for LULC tasks.RF is optimal for monitoring and plan-ning land use in dynamic arid areas.Future research should explore hybrid methods and diversify training sites to improve performance.展开更多
It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics"....It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". These findings have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move beyond not only instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS oi1 several benchmark datasets.展开更多
Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant informa...Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection(OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets(Web KB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na?ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods(information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods(improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.展开更多
文摘Purpose:With more and more digital collections of various information resources becoming available,also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems.While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification(DDC)classes for Swedish digital collections,the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.Design/methodology/approach:State-of-the-art machine learning algorithms require at least 1,000 training examples per class.The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data(totaling 802 classes in the training and testing sample,out of 14,413 classes at all levels).Findings:Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task.Word embeddings combined with different types of neural networks(simple linear network,standard neural network,1 D convolutional neural network,and recurrent neural network)produced worse results than Support Vector Machine,but reach close results,with the benefit of a smaller representation size.Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input.Stemming only marginally improves the results.Removed stop-words reduced accuracy in most cases,while removing less frequent words increased it marginally.The greatest impact is produced by the number of training examples:81.90%accuracy on the training set is achieved when at least 1,000 records per class are available in the training set,and 66.13%when too few records(often less than A Comparison of Approaches100 per class)on which to train are available—and these hold only for top 3 hierarchical levels(803 instead of 14,413 classes).Research limitations:Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes,skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.Practical implications:In conclusion,for operative information retrieval systems applying purely automatic DDC does not work,either using machine learning(because of the lack of training data for the large number of DDC classes)or using string-matching algorithm(because DDC characteristics perform well for automatic classification only in a small number of classes).Over time,more training examples may become available,and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC.In order for quality information services to reach the objective of highest possible precision and recall,automatic classification should never be implemented on its own;instead,machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.Originality/value:The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems.Due to lack of sufficient training data across the entire set of classes,an approach complementing machine learning,that of string matching,was applied.This combination should be explored further since it provides the potential for real-life applications with large target classification systems.
文摘This paper discusses the problem of classifying a multivariate Gaussian random field observation into one of the several categories specified by different parametric mean models. Investigation is conducted on the classifier based on plug-in Bayes classification rule (PBCR) formed by replacing unknown parameters in Bayes classification rule (BCR) with category parameters estimators. This is the extension of the previous one from the two category cases to the multi-category case. The novel closed-form expressions for the Bayes classification probability and actual correct classification rate associated with PBCR are derived. These correct classification rates are suggested as performance measures for the classifications procedure. An empirical study has been carried out to analyze the dependence of derived classification rates on category parameters.
基金Project(KC18071)supported by the Application Foundation Research Program of Xuzhou,ChinaProjects(2017YFC0804401,2017YFC0804409)supported by the National Key R&D Program of China
文摘The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data.In order to solve this problem,this paper proposes and implements a parallel naive Bayes algorithm(PNBA)for Chinese text classification based on Spark,a parallel memory computing platform for big data.This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets(RDD).For comparison,a PNBA based on Hadoop is also implemented.The test results show that in the same computing environment and for the same text sets,the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability.Therefore,Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.
文摘An effective domain ontology automatically constructed is proposed in this paper. The main concept is using the Formal Concept Analysis to automatically establish domain ontology. Finally, the ontology is acted as the base for the Naive Bayes classifier to approve the effectiveness of the domain ontology for document classification. The 1752 documents divided into 10 categories are used to assess the effectiveness of the ontology, where 1252 and 500 documents are the training and testing documents, respectively. The Fl-measure is as the assessment criteria and the following three results are obtained. The average recall of Naive Bayes classifier is 0.94. Therefore, in recall, the performance of Naive Bayes classifier is excellent based on the automatically constructed ontology. The average precision of Naive Bayes classifier is 0.81. Therefore, in precision, the performance of Naive Bayes classifier is gored based on the automatically constructed ontology. The average Fl-measure for 10 categories by Naive Bayes classifier is 0.86. Therefore, the performance of Naive Bayes classifier is effective based on the automatically constructed ontology in the point of F 1-measure. Thus, the domain ontology automatically constructed could indeed be acted as the document categories to reach the effectiveness for document classification.
基金This work is supported by the KIAS(Research Number:CG076601)and in part by Sejong University Faculty Research Fund.
文摘Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent.Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing.The communication using the Roman characters,which are used in the script of Urdu language on social media,is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply.English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past.This is due to the numerous complexities involved in the processing of Roman Urdu data.The complexities associated with Roman Urdu include the non-availability of the tagged corpus,lack of a set of rules,and lack of standardized spellings.A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook,Twitter but meaningful information can only be extracted if data is in a structured format.We have developed a Roman Urdu news headline classifier,which will help to classify news into relevant categories on which further analysis and modeling can be done.The author of this research aims to develop the Roman Urdu news classifier,which will classify the news into five categories(health,business,technology,sports,international).First,we will develop the news dataset using scraping tools and then after preprocessing,we will compare the results of different machine learning algorithms like Logistic Regression(LR),Multinomial Naïve Bayes(MNB),Long short term memory(LSTM),and Convolutional Neural Network(CNN).After this,we will use a phonetic algorithm to control lexical variation and test news from different websites.The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news.After applying above mentioned different machine learning algorithms,results have shown that Multinomial Naïve Bayes classifier is giving the best accuracy of 90.17%which is due to the noise lexical variation.
文摘This paper proposed an improved Naïve Bayes Classifier for sentimental analysis from a large-scale dataset such as in YouTube.YouTube contains large unstructured and unorganized comments and reactions,which carry important information.Organizing large amounts of data and extracting useful information is a challenging task.The extracted information can be considered as new knowledge and can be used for deci sion-making.We extract comments from YouTube on videos and categorized them in domain-specific,and then apply the Naïve Bayes classifier with improved techniques.Our method provided a decent 80%accuracy in classifying those comments.This experiment shows that the proposed method provides excellent adaptability for large-scale text classification.
基金RUSA-Phase 2.0 grant sanctioned vide Letter No.F.24-51/2014-U,Policy(TNMulti-Gen)Dept.of Edn.Govt.of India,Dt.09.10.2018.
文摘Diabetic retinopathy(DR)is a disease with an increasing prevalence and the major reason for blindness among working-age population.The possibility of severe vision loss can be extensively reduced by timely diagnosis and treatment.An automated screening for DR has been identified as an effective method for early DR detection,which can decrease the workload associated to manual grading as well as save diagnosis costs and time.Several studies have been carried out to develop automated detection and classification models for DR.This paper presents a new IoT and cloud-based deep learning for healthcare diagnosis of Diabetic Retinopathy(DR).The proposed model incorporates different processes namely data collection,preprocessing,segmentation,feature extraction and classification.At first,the IoT-based data collection process takes place where the patient wears a head mounted camera to capture the retinal fundus image and send to cloud server.Then,the contrast level of the input DR image gets increased in the preprocessing stage using Contrast Limited Adaptive Histogram Equalization(CLAHE)model.Next,the preprocessed image is segmented using Adaptive Spatial Kernel distance measure-based Fuzzy C-Means clustering(ASKFCM)model.Afterwards,deep Convolution Neural Network(CNN)based Inception v4 model is applied as a feature extractor and the resulting feature vectors undergo classification in line with the Gaussian Naive Bayes(GNB)model.The proposed model was tested using a benchmark DR MESSIDOR image dataset and the obtained results showcased superior performance of the proposed model over other such models compared in the study.
基金supported by the Korea Ministry of Environment as ‘‘Climate Change Correspondence Program(2014001310008)’’ and ‘‘The Eco-Innovation Project(Project Number:2012-00021-0002)’’
文摘There is growing interest in using ecosystem services to aid development of management strategies that target sustainability and enhance ecosystem support to humans. Challenges remain in the search for methods and indicators that can quantify ecosystem services using metrics that are meaningful in light of their high priorities. We developed a framework to link ecosystems to human wellbeing based on a stepwise approach. We evaluated prospective models in terms of their capacity to quantify national ecosystem services of forests. The most applicable models were subsequently used to quantify ecosystem services. The Korea Forest Research Institute model sat- isfied all criteria in its first practical use. A total of 12 key ecosystem services were identified. For our case study, we quantified four ecosystem functions, viz. water storage capacity in forest soil for water storage service, reduced suspended sediment for water purification service, reduced soil erosion for landslide prevention service, and reduced sediment yield for sediment regulation service. Water storage capacity in forest soil was estimated at 2142 t/ha, and reduced suspended sediment was estimated at 608 kg/ ha. Reduced soil erosion was estimated at 77 m^3/ha, and reduced sediment yield was estimated at 285 m^3/ha. These results were similar to those reported by previous studies. Mapped results revealed hotspots of ecosystem services around protected areas that were particularly rich in bio- diversity. In addition, the proposed framework illustrated that quantification of ecosystem services could be sup- ported by the spatial flow of ecosystem services. However, our approach did not address challenges faced when quantifying connections between ecosystem indicators and actual benefits of services described.
文摘The maximum of k numerical functions defined on , , by , ??is used here in Statistical classification. Previously, it has been used in Statistical Discrimination [1] and in Clustering [2]. We present first some theoretical results on this function, and then its application in classification using a computer program we have developed. This approach leads to clear decisions, even in cases where the extension to several classes of Fisher’s linear discriminant function fails to be effective.
文摘Many companies like credit card, insurance, bank, retail industry require direct marketing. Data mining can help those institutes to set marketing goal. Data mining techniques have good prospects in their target audiences and improve the likelihood of response. In this work we have investigated two data mining techniques: the Naive Bayes and the C4.5 decision tree algorithms. The goal of this work is to predict whether a client will subscribe a term deposit. We also made comparative study of performance of those two algorithms. Publicly available UCI data is used to train and test the performance of the algorithms. Besides, we extract actionable knowledge from decision tree that focuses to take interesting and important decision in business area.
基金The authors would like to extend their gratitude to Department of Graduate StudiesNepal College of Information Technology for its constant support and motivationWe would also like to thank the Journal of Information Security for its feedbacks and reviews
文摘In recent times among the multitude of attacks present in network system, DDoS attacks have emerged to be the attacks with the most devastating effects. The main objective of this paper is to propose a system that effectively detects DDoS attacks appearing in any networked system using the clustering technique of data mining followed by classification. This method uses a Heuristics Clustering Algorithm (HCA) to cluster the available data and Na?ve Bayes (NB) classification to classify the data and detect the attacks created in the system based on some network attributes of the data packet. The clustering algorithm is based in unsupervised learning technique and is sometimes unable to detect some of the attack instances and few normal instances, therefore classification techniques are also used along with clustering to overcome this classification problem and to enhance the accuracy. Na?ve Bayes classifiers are based on very strong independence assumptions with fairly simple construction to derive the conditional probability for each relationship. A series of experiment is performed using “The CAIDA UCSD DDoS Attack 2007 Dataset” and “DARPA 2000 Dataset” and the efficiency of the proposed system has been tested based on the following performance parameters: Accuracy, Detection Rate and False Positive Rate and the result obtained from the proposed system has been found that it has enhanced accuracy and detection rate with low false positive rate.
文摘Arid and semiarid regions face challenges such as bushland encroachment and agricultural expansion,especially in Tiaty,Baringo,Kenya.These issues create mixed opportunities for pastoral and agro-pastoral livelihoods.Machine learn-ing methods for land use and land cover(LULC)classification are vital for monitoring environmental changes.Remote sensing advancements increase the potential for classifying land cover,which requires assessing algorithm ac-curacy and efficiency for fragile environments.This research identifies the best algorithms for LULC monitoring and developing adaptive methods for sensi-tive ecosystems.Landsat-9 imagery from January to April 2023 facilitated land use class identification.Preprocessing in the Google Earth Engine applied spec-tral indices such as the NDVI,NDWI,BSI,and NDBI.Supervised classification uses random forest(RF),support vector machine(SVM),classification and re-gression trees(CARTs),gradient boosting trees(GBTs),and naïve Bayes.An accuracy assessment was used to determine the optimal classifiers for future land use analyses.The evaluation revealed that the RF model achieved 84.4%accuracy with a 0.85 weighted F1 score,indicating its effectiveness for complex LULC data.In contrast,the GBT and CART methods yielded moderate F1 scores(0.77 and 0.68),indicating the presence of overclassification and class imbalance issues.The SVM and naïve Bayes methods were less accurate,ren-dering them unsuitable for LULC tasks.RF is optimal for monitoring and plan-ning land use in dynamic arid areas.Future research should explore hybrid methods and diversify training sites to improve performance.
基金supported in part by the Scientific and Technological Research Council of Turkey(TBíTAK)under Grant No.111E239
文摘It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". These findings have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move beyond not only instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS oi1 several benchmark datasets.
基金Project supported by the Joint Funds of the National Natural Science Foundation of China(No.U1509214)the Beijing Natural Science Foundation,China(No.4174105)+1 种基金the Key Projects of National Bureau of Statistics of China(No.2017LZ05)the Discipline Construction Foundation of the Central University of Finance and Economics,China(No.2016XX02)
文摘Feature selection is an important approach to dimensionality reduction in the field of text classification. Because of the difficulty in handling the problem that the selected features always contain redundant information, we propose a new simple feature selection method, which can effectively filter the redundant features. First, to calculate the relationship between two words, the definitions of word frequency based relevance and correlative redundancy are introduced. Furthermore, an optimal feature selection(OFS) method is chosen to obtain a feature subset FS1. Finally, to improve the execution speed, the redundant features in FS1 are filtered by combining a predetermined threshold, and the filtered features are memorized in the linked lists. Experiments are carried out on three datasets(Web KB, 20-Newsgroups, and Reuters-21578) where in support vector machines and na?ve Bayes are used. The results show that the classification accuracy of the proposed method is generally higher than that of typical traditional methods(information gain, improved Gini index, and improved comprehensively measured feature selection) and the OFS methods. Moreover, the proposed method runs faster than typical mutual information-based methods(improved and normalized mutual information-based feature selections, and multilabel feature selection based on maximum dependency and minimum redundancy) while simultaneously ensuring classification accuracy. Statistical results validate the effectiveness of the proposed method in handling redundant information in text classification.