Logistic regression is often used to solve linear binary classification problems such as machine vision,speech recognition,and handwriting recognition.However,it usually fails to solve certain nonlinear multi-classifi...Logistic regression is often used to solve linear binary classification problems such as machine vision,speech recognition,and handwriting recognition.However,it usually fails to solve certain nonlinear multi-classification problem,such as problem with non-equilibrium samples.Many scholars have proposed some methods,such as neural network,least square support vector machine,AdaBoost meta-algorithm,etc.These methods essentially belong to machine learning categories.In this work,based on the probability theory and statistical principle,we propose an improved logistic regression algorithm based on kernel density estimation for solving nonlinear multi-classification.We have compared our approach with other methods using non-equilibrium samples,the results show that our approach guarantees sample integrity and achieves superior classification.展开更多
Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with ...Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.展开更多
This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discuss...This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.展开更多
With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one o...With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.展开更多
Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of word...Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.展开更多
Microseismic monitoring technology is widely used in tunnel and coal mine safety production.For signals generated by ultra-weak microseismic events,traditional sensors encounter limitations in terms of detection sensi...Microseismic monitoring technology is widely used in tunnel and coal mine safety production.For signals generated by ultra-weak microseismic events,traditional sensors encounter limitations in terms of detection sensitivity.Given the complex engineering environment,automatic multi-classification of microseismic data is highly required.In this study,we use acceleration sensors to collect signals and combine the improved Visual Geometry Group with a convolutional block attention module to obtain a new network structure,termed CNN_BAM,for automatic classification and identification of microseismic events.We use the dataset collected from the Hanjiang-to-Weihe River Diversion Project to train and validate the network model.Results show that the CNN_BAM model exhibits good feature extraction ability,achieving a recognition accuracy of 99.29%,surpassing all its counterparts.The stability and accuracy of the classification algorithm improve remarkably.In addition,through fine-tuning and migration to the Pan Ⅱ Mine Project,the network demonstrates reliable generalization performance.This outcome reflects its adaptability across different projects and promising application prospects.展开更多
In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and ...In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and machine learning(ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics.Furthermore,newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable.Hence,effective detection depends on identifying the most critical features.Traditional feature selection(FS)methods often struggle to enhance ML model performance and instead decrease it.To combat these issues,we propose an innovative method using explainable AI(XAI)to enhance FS in ML models and improve the identification of phishing websites.Specifically,we employ SHapley Additive exPlanations(SHAP)for global perspective and aggregated local interpretable model-agnostic explanations(LIME)to deter-mine specific localized patterns.The proposed SHAP and LIME-aggregated FS(SLA-FS)framework pinpoints the most informative features,enabling more precise,swift,and adaptable phishing detection.Applying this approach to an up-to-date web phishing dataset,we evaluate the performance of three ML models before and after FS to assess their effectiveness.Our findings reveal that random forest(RF),with an accuracy of 97.41%and XGBoost(XGB)at 97.21%significantly benefit from the SLA-FS framework,while k-nearest neighbors lags.Our framework increases the accuracy of RF and XGB by 0.65%and 0.41%,respectively,outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset,showcasing its potential.展开更多
With the rapid development of Internet of Things technology,the sharp increase in network devices and their inherent security vulnerabilities present a stark contrast,bringing unprecedented challenges to the field of ...With the rapid development of Internet of Things technology,the sharp increase in network devices and their inherent security vulnerabilities present a stark contrast,bringing unprecedented challenges to the field of network security,especially in identifying malicious attacks.However,due to the uneven distribution of network traffic data,particularly the imbalance between attack traffic and normal traffic,as well as the imbalance between minority class attacks and majority class attacks,traditional machine learning detection algorithms have significant limitations when dealing with sparse network traffic data.To effectively tackle this challenge,we have designed a lightweight intrusion detection model based on diffusion mechanisms,named Diff-IDS,with the core objective of enhancing the model’s efficiency in parsing complex network traffic features,thereby significantly improving its detection speed and training efficiency.The model begins by finely filtering network traffic features and converting them into grayscale images,while also employing image-flipping techniques for data augmentation.Subsequently,these preprocessed images are fed into a diffusion model based on the Unet architecture for training.Once the model is trained,we fix the weights of the Unet network and propose a feature enhancement algorithm based on feature masking to further boost the model’s expressiveness.Finally,we devise an end-to-end lightweight detection strategy to streamline the model,enabling efficient lightweight detection of imbalanced samples.Our method has been subjected to multiple experimental tests on renowned network intrusion detection benchmarks,including CICIDS 2017,KDD 99,and NSL-KDD.The experimental results indicate that Diff-IDS leads in terms of detection accuracy,training efficiency,and lightweight metrics compared to the current state-of-the-art models,demonstrating exceptional detection capabilities and robustness.展开更多
基金The authors would like to thank all anonymous reviewers for their suggestions and feedback.This work was supported by National Natural Science Foundation of China(Grant No.61379103).
文摘Logistic regression is often used to solve linear binary classification problems such as machine vision,speech recognition,and handwriting recognition.However,it usually fails to solve certain nonlinear multi-classification problem,such as problem with non-equilibrium samples.Many scholars have proposed some methods,such as neural network,least square support vector machine,AdaBoost meta-algorithm,etc.These methods essentially belong to machine learning categories.In this work,based on the probability theory and statistical principle,we propose an improved logistic regression algorithm based on kernel density estimation for solving nonlinear multi-classification.We have compared our approach with other methods using non-equilibrium samples,the results show that our approach guarantees sample integrity and achieves superior classification.
基金supported by the National Natural Science Foundation of China under Grants No.61100205,No.60873001the HiTech Research and Development Program of China under Grant No.2011AA010705the Fundamental Research Funds for the Central Universities under Grant No.2009RC0212
文摘Since webpage classification is different from traditional text classification with its irregular words and phrases,massive and unlabeled features,which makes it harder for us to obtain effective feature.To cope with this problem,we propose two scenarios to extract meaningful strings based on document clustering and term clustering with multi-strategies to optimize a Vector Space Model(VSM) in order to improve webpage classification.The results show that document clustering work better than term clustering in coping with document content.However,a better overall performance is obtained by spectral clustering with document clustering.Moreover,owing to image existing in a same webpage with document content,the proposed method is also applied to extract image meaningful terms,and experiment results also show its effectiveness in improving webpage classification.
文摘This article acquaints the public with the insights gained from conducting document searches in the Slovak public administration information system,when supported by knowledge of its management.Additionally,it discusses the advantages of simulating performance parameters and comparing the obtained results with the real parameters of the eZbierka(eCollection)legislation webpage.This comparison was based upon simulated results,obtained through the Gatling simulation tool,versus those obtained from measuring the properties of the public administration legislation webpage.Both sets of data(simulated and real),were generated via the the document search technologies in place on the eZbierka legislation webpage.The webpage provides users with binding laws and bylaws available in an electronically signed PDF file format.It is free open source.In order to simulate the accessing of documents on the webpage,the Gatling simulation tool was used.This tool simulated the activity,performed in the background of the information system,as a user attempted to read the data via the steps mentioned in the scenario.The settings of the simulated environment corresponded as much as possible to the hardware parameters and network infrastructure properties used for the operation of the respective information system.Based on this data,through load changing,we determined the number of users,the response time to queries,and their number;these parameters define the throughput of the server of the legislation webpage.The required parameter determination and performance of search technology operations are confirmed by a suitable hardware design and the webpage property parameter settings.We used the data from the eZbierka legislation webpage from its operational period of January 2016 to January 2019 for comparison,and analysed the relevant data to determine the parameter values of the legislation webpage of the slov-lex information system.The basic elements of the design solution include the technology used,the technology for searching the legislative documents with support of a searching tool,and a graphic database interface.By comparing the results,their dependencies,and proportionality,it is possible to ascertain the proper determination and appropriate applied search technology for selection of documents.Further,the graphic interface of the real web database was confirmed.
文摘With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Localitysensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naive bayes one.
文摘Webpage keyword extraction is very important for automatically extracting webpage summary, retrieval, automatic question answering, and character relation extraction, etc. In this paper, the environment vector of words is constructed with lexical chain, words context, word frequency, and webpage attribute weights according to the keywords characteristics. Thus, the multi-factor table of words is constructed, and then the keyword extraction issue is divided into two types according to the multi-factor table of words: keyword and non-keyword. Then, words are classified again with the support vector machine (SVM), and this method can extract the keywords of unregistered words and eliminate the semantic ambiguities. Experimental results show that this method is with higher precision ratio and recall ratio compared with the simple ff/idf algorithm.
基金supported by the Key Research and Development Plan of Anhui Province(202104a05020059)the Excellent Scientific Research and Innovation Team of Anhui Province(2022AH010003)support from Hefei Comprehensive National Science Center is highly appreciated.
文摘Microseismic monitoring technology is widely used in tunnel and coal mine safety production.For signals generated by ultra-weak microseismic events,traditional sensors encounter limitations in terms of detection sensitivity.Given the complex engineering environment,automatic multi-classification of microseismic data is highly required.In this study,we use acceleration sensors to collect signals and combine the improved Visual Geometry Group with a convolutional block attention module to obtain a new network structure,termed CNN_BAM,for automatic classification and identification of microseismic events.We use the dataset collected from the Hanjiang-to-Weihe River Diversion Project to train and validate the network model.Results show that the CNN_BAM model exhibits good feature extraction ability,achieving a recognition accuracy of 99.29%,surpassing all its counterparts.The stability and accuracy of the classification algorithm improve remarkably.In addition,through fine-tuning and migration to the Pan Ⅱ Mine Project,the network demonstrates reliable generalization performance.This outcome reflects its adaptability across different projects and promising application prospects.
文摘In the evolving landscape of cyber threats,phishing attacks pose significant challenges,particularly through deceptive webpages designed to extract sensitive information under the guise of legitimacy.Conventional and machine learning(ML)-based detection systems struggle to detect phishing websites owing to their constantly changing tactics.Furthermore,newer phishing websites exhibit subtle and expertly concealed indicators that are not readily detectable.Hence,effective detection depends on identifying the most critical features.Traditional feature selection(FS)methods often struggle to enhance ML model performance and instead decrease it.To combat these issues,we propose an innovative method using explainable AI(XAI)to enhance FS in ML models and improve the identification of phishing websites.Specifically,we employ SHapley Additive exPlanations(SHAP)for global perspective and aggregated local interpretable model-agnostic explanations(LIME)to deter-mine specific localized patterns.The proposed SHAP and LIME-aggregated FS(SLA-FS)framework pinpoints the most informative features,enabling more precise,swift,and adaptable phishing detection.Applying this approach to an up-to-date web phishing dataset,we evaluate the performance of three ML models before and after FS to assess their effectiveness.Our findings reveal that random forest(RF),with an accuracy of 97.41%and XGBoost(XGB)at 97.21%significantly benefit from the SLA-FS framework,while k-nearest neighbors lags.Our framework increases the accuracy of RF and XGB by 0.65%and 0.41%,respectively,outperforming traditional filter or wrapper methods and any prior methods evaluated on this dataset,showcasing its potential.
基金supported by the Key Research and Development Program of Hainan Province(Grant Nos.ZDYF2024GXJS014,ZDYF2023GXJS163)the National Natural Science Foundation of China(NSFC)(Grant Nos.62162022,62162024)Collaborative Innovation Project of Hainan University(XTCX2022XXB02).
文摘With the rapid development of Internet of Things technology,the sharp increase in network devices and their inherent security vulnerabilities present a stark contrast,bringing unprecedented challenges to the field of network security,especially in identifying malicious attacks.However,due to the uneven distribution of network traffic data,particularly the imbalance between attack traffic and normal traffic,as well as the imbalance between minority class attacks and majority class attacks,traditional machine learning detection algorithms have significant limitations when dealing with sparse network traffic data.To effectively tackle this challenge,we have designed a lightweight intrusion detection model based on diffusion mechanisms,named Diff-IDS,with the core objective of enhancing the model’s efficiency in parsing complex network traffic features,thereby significantly improving its detection speed and training efficiency.The model begins by finely filtering network traffic features and converting them into grayscale images,while also employing image-flipping techniques for data augmentation.Subsequently,these preprocessed images are fed into a diffusion model based on the Unet architecture for training.Once the model is trained,we fix the weights of the Unet network and propose a feature enhancement algorithm based on feature masking to further boost the model’s expressiveness.Finally,we devise an end-to-end lightweight detection strategy to streamline the model,enabling efficient lightweight detection of imbalanced samples.Our method has been subjected to multiple experimental tests on renowned network intrusion detection benchmarks,including CICIDS 2017,KDD 99,and NSL-KDD.The experimental results indicate that Diff-IDS leads in terms of detection accuracy,training efficiency,and lightweight metrics compared to the current state-of-the-art models,demonstrating exceptional detection capabilities and robustness.