Estimating probability density functions(PDFs)is critical in data analysis,particularly for complex multimodal distributions.traditional kernel density estimator(KDE)methods often face challenges in accurately capturi...Estimating probability density functions(PDFs)is critical in data analysis,particularly for complex multimodal distributions.traditional kernel density estimator(KDE)methods often face challenges in accurately capturing multimodal structures due to their uniform weighting scheme,leading to mode loss and degraded estimation accuracy.This paper presents the flexible kernel density estimator(F-KDE),a novel nonparametric approach designed to address these limitations.F-KDE introduces the concept of kernel unit inequivalence,assigning adaptive weights to each kernel unit,which better models local density variations in multimodal data.The method optimises an objective function that integrates estimation error and log-likelihood,using a particle swarm optimisation(PSO)algorithm that automatically determines optimal weights and bandwidths.Through extensive experiments on synthetic and real-world datasets,we demonstrated that(1)the weights and bandwidths in F-KDE stabilise as the optimisation algorithm iterates,(2)F-KDE effectively captures the multimodal characteristics and(3)F-KDE outperforms state-of-the-art density estimation methods regarding accuracy and robustness.The results confirm that F-KDE provides a valuable solution for accurately estimating multimodal PDFs.展开更多
Long-term urban traffic flow prediction is an important task in the field of intelligent transportation,as it can help optimize traffic management and improve travel efficiency.To improve prediction accuracy,a crucial...Long-term urban traffic flow prediction is an important task in the field of intelligent transportation,as it can help optimize traffic management and improve travel efficiency.To improve prediction accuracy,a crucial issue is how to model spatiotemporal dependency in urban traffic data.In recent years,many studies have adopted spatiotemporal neural networks to extract key information from traffic data.However,most models ignore the semantic spatial similarity between long-distance areas when mining spatial dependency.They also ignore the impact of predicted time steps on the next unpredicted time step for making long-term predictions.Moreover,these models lack a comprehensive data embedding process to represent complex spatiotemporal dependency.This paper proposes a multi-scale persistent spatiotemporal transformer(MSPSTT)model to perform accurate long-term traffic flow prediction in cities.MSPSTT adopts an encoder-decoder structure and incorporates temporal,periodic,and spatial features to fully embed urban traffic data to address these issues.The model consists of a spatiotemporal encoder and a spatiotemporal decoder,which rely on temporal,geospatial,and semantic space multi-head attention modules to dynamically extract temporal,geospatial,and semantic characteristics.The spatiotemporal decoder combines the context information provided by the encoder,integrates the predicted time step information,and is iteratively updated to learn the correlation between different time steps in the broader time range to improve the model’s accuracy for long-term prediction.Experiments on four public transportation datasets demonstrate that MSPSTT outperforms the existing models by up to 9.5%on three common metrics.展开更多
This paper describes a computational model for the implementation of causal learning in cognitive agents. The Conscious Emotional Learning Tutoring System (CELTS) is able to provide dynamic fine-tuned assistance to us...This paper describes a computational model for the implementation of causal learning in cognitive agents. The Conscious Emotional Learning Tutoring System (CELTS) is able to provide dynamic fine-tuned assistance to users. The integration of a Causal Learning mechanism within CELTS allows CELTS to first establish, through a mix of datamining algorithms, gross user group models. CELTS then uses these models to find the cause of users' mistakes, evaluate their performance, predict their future behavior, and, through a pedagogical knowledge mechanism, decide which tutoring intervention fits best.展开更多
The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss functio...The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss function,our approach stands out from conventional practices by directly minimizing the loss itself.Leveraging the large deviation theory,we firstly provide an asymptotically optimal allocation rule for the TB problem,and then propose a parameter-free Large Deviation(LD)algorithm to make the allocation rule implementable.Central limit theorem-based Large Deviation(CLD)algorithm is further proposed as a supplement to improve the computation efficiency using normal approximation.Extensive experiments are conducted to validate the superiority of our algorithms compared to existing methods,and demonstrate their broader applications to more general distributions and various kinds of loss functions.展开更多
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result,...High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithm outperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.展开更多
The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its ...The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minorityclass sample point generation algorithm, named overlapping minimization SMOTE(OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes,support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the Git Hub platform at https://github.com/luxuan123123/OM-SMOTE/.展开更多
Disinformation,often known as fake news,is a major issue that has received a lot of attention lately.Many researchers have proposed effective means of detecting and addressing it.Current machine and deep learning base...Disinformation,often known as fake news,is a major issue that has received a lot of attention lately.Many researchers have proposed effective means of detecting and addressing it.Current machine and deep learning based methodologies for classification/detection of fake news are content-based,network(propagation)based,or multimodal methods that combine both textual and visual information.We introduce here a framework,called FNACSPM,based on sequential pattern mining(SPM),for fake news analysis and classification.In this framework,six publicly available datasets,containing a diverse range of fake and real news,and their combination,are first transformed into a proper format.Then,algorithms for SPM are applied to the transformed datasets to extract frequent patterns(and rules)of words,phrases,or linguistic features.The obtained patterns capture distinctive characteristics associated with fake or real news content,providing valuable insights into the underlying structures and commonalities of misinformation.Subsequently,the discovered frequent patterns are used as features for fake news classification.This framework is evaluated with eight classifiers,and their performance is assessed with various metrics.Extensive experiments were performed and obtained results show that FNACSPM outperformed other state-of-the-art approaches for fake news classification,and that it expedites the classification task with high accuracy.展开更多
Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP...Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis.However,a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks.While a large sample size increases the burden of big data computation,a small size will lead to insufficient distribution information for RSP data blocks.To address this problem,this paper presents a novel density estimation-based method(DEM)to determine the optimal sample size for RSP data blocks.First,a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz(DKW)inequality by using the fixed-point iteration(FPI)method.Second,a practical sample size is determined by minimizing the validation error of a kernel density estimator(KDE)constructed on RSP data blocks for an increasing sample size.Finally,a series of persuasive experiments are conducted to validate the feasibility,rationality,and effectiveness of DEM.Experimental results show that(1)the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality;(2)the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function(p.d.f);and(3)DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f.estimation.This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.展开更多
基金supported by the Natural Science Foundation of Guangdong Province(Grant 2023A1515011667)Science and Technology Major Project of Shenzhen(Grant KJZD20230923114809020)Key Basic Research Foundation of Shenzhen(Grant JCYJ20220818100205012).
文摘Estimating probability density functions(PDFs)is critical in data analysis,particularly for complex multimodal distributions.traditional kernel density estimator(KDE)methods often face challenges in accurately capturing multimodal structures due to their uniform weighting scheme,leading to mode loss and degraded estimation accuracy.This paper presents the flexible kernel density estimator(F-KDE),a novel nonparametric approach designed to address these limitations.F-KDE introduces the concept of kernel unit inequivalence,assigning adaptive weights to each kernel unit,which better models local density variations in multimodal data.The method optimises an objective function that integrates estimation error and log-likelihood,using a particle swarm optimisation(PSO)algorithm that automatically determines optimal weights and bandwidths.Through extensive experiments on synthetic and real-world datasets,we demonstrated that(1)the weights and bandwidths in F-KDE stabilise as the optimisation algorithm iterates,(2)F-KDE effectively captures the multimodal characteristics and(3)F-KDE outperforms state-of-the-art density estimation methods regarding accuracy and robustness.The results confirm that F-KDE provides a valuable solution for accurately estimating multimodal PDFs.
基金the National Natural Science Foundation of China under Grant No.62272087Science and Technology Planning Project of Sichuan Province under Grant No.2023YFG0161.
文摘Long-term urban traffic flow prediction is an important task in the field of intelligent transportation,as it can help optimize traffic management and improve travel efficiency.To improve prediction accuracy,a crucial issue is how to model spatiotemporal dependency in urban traffic data.In recent years,many studies have adopted spatiotemporal neural networks to extract key information from traffic data.However,most models ignore the semantic spatial similarity between long-distance areas when mining spatial dependency.They also ignore the impact of predicted time steps on the next unpredicted time step for making long-term predictions.Moreover,these models lack a comprehensive data embedding process to represent complex spatiotemporal dependency.This paper proposes a multi-scale persistent spatiotemporal transformer(MSPSTT)model to perform accurate long-term traffic flow prediction in cities.MSPSTT adopts an encoder-decoder structure and incorporates temporal,periodic,and spatial features to fully embed urban traffic data to address these issues.The model consists of a spatiotemporal encoder and a spatiotemporal decoder,which rely on temporal,geospatial,and semantic space multi-head attention modules to dynamically extract temporal,geospatial,and semantic characteristics.The spatiotemporal decoder combines the context information provided by the encoder,integrates the predicted time step information,and is iteratively updated to learn the correlation between different time steps in the broader time range to improve the model’s accuracy for long-term prediction.Experiments on four public transportation datasets demonstrate that MSPSTT outperforms the existing models by up to 9.5%on three common metrics.
文摘This paper describes a computational model for the implementation of causal learning in cognitive agents. The Conscious Emotional Learning Tutoring System (CELTS) is able to provide dynamic fine-tuned assistance to users. The integration of a Causal Learning mechanism within CELTS allows CELTS to first establish, through a mix of datamining algorithms, gross user group models. CELTS then uses these models to find the cause of users' mistakes, evaluate their performance, predict their future behavior, and, through a pedagogical knowledge mechanism, decide which tutoring intervention fits best.
基金supported by the Natural Science Foundation of Guangdong Province(No.2023A1515011667)the Science and Technology Major Project of Shenzhen(No.KJZD20230923114809020)+3 种基金the Key Basic Research Foundation of Shenzhen(No.JCYJ20220818100205012)the Guangdong Basic and Applied Basic Research Foundation(No.2023B1515120020)the Shenzhen Science and Technology Program(No.RCBS20221008093331068)the Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone Project(No.HZQSWS-KCCYB-2024016).
文摘The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss function,our approach stands out from conventional practices by directly minimizing the loss itself.Leveraging the large deviation theory,we firstly provide an asymptotically optimal allocation rule for the TB problem,and then propose a parameter-free Large Deviation(LD)algorithm to make the allocation rule implementable.Central limit theorem-based Large Deviation(CLD)algorithm is further proposed as a supplement to improve the computation efficiency using normal approximation.Extensive experiments are conducted to validate the superiority of our algorithms compared to existing methods,and demonstrate their broader applications to more general distributions and various kinds of loss functions.
基金the National Natural Science Foundation of China (Grant Nos. 61133005, 61432005, 61370095, 61472124, 61202109, and 61472126)the International Science and Technology Cooperation Program of China (2015DFA11240 and 2014DFBS0010).
文摘High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithm outperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly.
基金Project supported by the National Natural Science Foundation of China(No.61972261)the Natural Science Foundation of Guangdong Province,China(No.2023A1515011667)+1 种基金the Key Basic Research Foundation of Shenzhen,China(No.JCYJ20220818100205012)the Basic Research Foundation of Shenzhen,China(No.JCYJ20210324093609026)。
文摘The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minorityclass sample point generation algorithm, named overlapping minimization SMOTE(OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes,support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the Git Hub platform at https://github.com/luxuan123123/OM-SMOTE/.
文摘Disinformation,often known as fake news,is a major issue that has received a lot of attention lately.Many researchers have proposed effective means of detecting and addressing it.Current machine and deep learning based methodologies for classification/detection of fake news are content-based,network(propagation)based,or multimodal methods that combine both textual and visual information.We introduce here a framework,called FNACSPM,based on sequential pattern mining(SPM),for fake news analysis and classification.In this framework,six publicly available datasets,containing a diverse range of fake and real news,and their combination,are first transformed into a proper format.Then,algorithms for SPM are applied to the transformed datasets to extract frequent patterns(and rules)of words,phrases,or linguistic features.The obtained patterns capture distinctive characteristics associated with fake or real news content,providing valuable insights into the underlying structures and commonalities of misinformation.Subsequently,the discovered frequent patterns are used as features for fake news classification.This framework is evaluated with eight classifiers,and their performance is assessed with various metrics.Extensive experiments were performed and obtained results show that FNACSPM outperformed other state-of-the-art approaches for fake news classification,and that it expedites the classification task with high accuracy.
基金This paper was supported by the National Natural Science Foundation of China(Grant No.61972261)the Natural Science Foundation of Guangdong Province(No.2023A1515011667)+1 种基金the Key Basic Research Foundation of Shenzhen(No.JCYJ20220818100205012)the Basic Research Foundation of Shenzhen(No.JCYJ20210324093609026)。
文摘Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis.However,a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks.While a large sample size increases the burden of big data computation,a small size will lead to insufficient distribution information for RSP data blocks.To address this problem,this paper presents a novel density estimation-based method(DEM)to determine the optimal sample size for RSP data blocks.First,a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz(DKW)inequality by using the fixed-point iteration(FPI)method.Second,a practical sample size is determined by minimizing the validation error of a kernel density estimator(KDE)constructed on RSP data blocks for an increasing sample size.Finally,a series of persuasive experiments are conducted to validate the feasibility,rationality,and effectiveness of DEM.Experimental results show that(1)the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality;(2)the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function(p.d.f);and(3)DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f.estimation.This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.