期刊文献+
共找到15篇文章
< 1 >
每页显示 20 50 100
风险最小化加权朴素贝叶斯分类器 被引量:1
1
作者 欧桂良 何玉林 +2 位作者 张曼静 黄哲学 philippe fournier-viger 《计算机科学》 北大核心 2025年第3期137-151,共15页
朴素贝叶斯分类器被誉为机器学习领域的十大经典算法之一,其以完备的理论基础和简单的模型结构而闻名,在许多的实际应用中取得了良好的分类效果。然而条件属性独立性假设在一定程度上限制了朴素贝叶斯分类器的性能,因此大量的改进工作... 朴素贝叶斯分类器被誉为机器学习领域的十大经典算法之一,其以完备的理论基础和简单的模型结构而闻名,在许多的实际应用中取得了良好的分类效果。然而条件属性独立性假设在一定程度上限制了朴素贝叶斯分类器的性能,因此大量的改进工作被提出来缓解这一问题,加权朴素贝叶斯分类器便是其中之一。在对边缘概率权重作用深入分析的基础之上,文中提出了一种基于风险最小化的加权朴素贝叶斯分类器(Risk Minimization-Based Weighted Naive Bayesian Classifier,RM-WNBC),即在权重确定的过程中同时考虑分类器的经验风险和权重的结构风险。不同于现有的过分关注朴素贝叶斯分类器外在泛化性能的改进策略,RM-WNBC是从朴素贝叶斯分类器的内在概率分布出发改善其泛化性能。经验风险度量了加权朴素贝叶斯分类器的分类能力,采用后验概率的估计质量表示;结构风险刻画了加权朴素贝叶斯分类器对属性相关性的处理,采用类条件概率的均方差表示。经验风险最小化保证了RM-WNBC可以获得良好的训练精度,同时结构风险最小化又使得RM-WNBC能够取得最佳的属性相关表达能力。为了获得RM-WNBC的最优权重,推导了高效且收敛的权重更新策略来保证结构风险和经验风险的最小化。在31个UCI和KEEL标准分类数据集上对RM-WNBC的可行性、合理性和有效性进行了验证。实验结果表明:1)RM-WNBC的训练和测试精度随着边缘概率权重的不断更新逐渐增加直至收敛;2)RM-WNBC具有比现有加权朴素贝叶斯分类器更好的属性相关性表达能力;3)在给定的显著性水平下,RM-WNBC在31个数据集上能够获得比经典朴素贝叶斯分类器、3种贝叶斯网络、4种加权朴素贝叶斯分类器和1种特征选择朴素贝叶斯分类器更好的训练和测试表现。 展开更多
关键词 朴素贝叶斯 独立性假设 加权朴素贝叶斯 结构风险 经验风险 贝叶斯网络
在线阅读 下载PDF
以标注确定性增强为导向的正类-无标签学习算法
2
作者 何玉林 何芃 +2 位作者 黄哲学 解为成 philippe fournier-viger 《计算机应用》 北大核心 2025年第7期2101-2112,共12页
正类-无标签学习(PUL)是在负例样本未知时,利用已知的少量正类样本和大量无标签样本训练出性能可被实际应用接受的分类器。现有的PUL算法存在共性的缺陷,即对无标签样本标注的不确定性较大,这将导致分类器学习到的分类边界不准确,并且... 正类-无标签学习(PUL)是在负例样本未知时,利用已知的少量正类样本和大量无标签样本训练出性能可被实际应用接受的分类器。现有的PUL算法存在共性的缺陷,即对无标签样本标注的不确定性较大,这将导致分类器学习到的分类边界不准确,并且限制了所训练分类器在新数据上的泛化能力。为了解决这一问题,提出一种以无标签样本标注确定性增强为导向的PUL(LCE-PUL)算法。首先,通过验证集的后验概率均值和正类样本集中心点的相似程度筛选出可靠的正类样本,并通过多轮迭代逐步精细化标注过程,以提升对无标签样本初步类别判断的准确性,从而提高无标签样本标注的确定性;其次,把这些可靠的正类样本与原始正类样本集合并,以形成新的正类样本集,之后从无标签样本集中将它剔除;然后,遍历新的无标签样本集,并利用每个样本与若干近邻点的相似程度再次筛选可靠正类样本,以更准确地推断无标签样本的潜在标签,从而减少误标注的可能性,并提升标注的确定性;最后,更新正类样本集,并把未被选中的无标签样本视为负类样本。在具有代表性的数据集上对LCE-PUL算法的可行性、合理性和有效性进行验证。随着迭代次数的增加,LCE-PUL算法的训练呈现收敛的特性,且当正类样本比例为40%、35%和30%时,LCE-PUL算法构建的分类器测试精度相较于基于特定成本函数的偏置支持向量机(BiasedSVM)算法、基于Dijkstra的PUL标签传播(LP-PUL)算法和基于标签传播的PUL(PU-LP)算法等5种代表性对比算法中最多提升了5.8、8.8和7.6个百分点。实验结果表明,LCE-PUL是一种有效处理PUL问题的机器学习算法。 展开更多
关键词 正类-无标签学习 标注确定性增强 后验概率 贝叶斯分类器 两步法
在线阅读 下载PDF
基于优先填补策略的Spark数据均衡分区方法 被引量:3
3
作者 何玉林 吴东彤 +1 位作者 philippe fournier-viger 黄哲学 《电子学报》 EI CAS CSCD 北大核心 2024年第10期3322-3335,共14页
Spark作为基于内存计算的分布式大数据处理框架,运行速度快且通用性强.在任务计算过程中,Spark的默认分区器HashPartitioner在处理倾斜数据时,容易产生各个分区数据量不平衡的情况,导致资源利用率低且运行效率差.现存的Spark均衡分区改... Spark作为基于内存计算的分布式大数据处理框架,运行速度快且通用性强.在任务计算过程中,Spark的默认分区器HashPartitioner在处理倾斜数据时,容易产生各个分区数据量不平衡的情况,导致资源利用率低且运行效率差.现存的Spark均衡分区改进方法,例如多阶段分区、迁移分区和采样分区等,大多存在尺度把控难、通信开销成本高、对采样过度依赖等缺陷.为改善上述问题,本文提出了一种基于优先填补策略的分区方法,同时考虑了样本数据和非样本数据的分配,以便实现对全部数据的均衡分区.该方法在对数据采样并根据样本信息估算出每个键的权值后,将键按照权值大小降序排列,依次将键在满足分区容忍度的条件下分配到前面的分区中,为未被采样的键预留后面的分区空间,以获得针对样本数据的分区方案.Spark根据分区方案对样本中出现的键对应的数据进行分区,没有出现的键对应的数据则直接映射到可分配的最后一个分区中.实验结果表明,新分区方法能够有效实现Spark数据的均衡分区,在美国运输统计局发布的真实航空数据集上,基于该方法设计的优先填补分区器的总运行时间比HashPartitioner平均缩短了15.3%,比现有的均衡数据分区器和哈希键值重分配分区器分别平均缩短了38.7%和30.2%. 展开更多
关键词 均衡分区 优先填补策略 数据倾斜 Spark算子 大数据
在线阅读 下载PDF
一种新的以服务质量为导向的Spark作业调度器
4
作者 何玉林 莫沛恒 +1 位作者 philippe fournier-viger 黄哲学 《大数据》 2025年第4期154-177,共24页
Spark大数据计算框架被广泛用于处理和分析爆发式增长的大数据。云端能够提供按需和按量付费的计算资源来满足用户的请求。当前,许多组织将大数据计算集群部署在云端上开展大数据计算任务,其需要高效地处理Spark作业调度问题以满足各种... Spark大数据计算框架被广泛用于处理和分析爆发式增长的大数据。云端能够提供按需和按量付费的计算资源来满足用户的请求。当前,许多组织将大数据计算集群部署在云端上开展大数据计算任务,其需要高效地处理Spark作业调度问题以满足各种用户对QoS的要求,如降低使用资源的花费和缩短作业的响应时间。而现有研究大多未能统一考虑多用户要求,忽略了Spark集群环境和工作负载的特性,导致资源浪费和用户对QoS的要求得不到满足等。为此,通过对部署在云端的Spark集群作业调度问题进行建模,设计了一种新的基于DRL技术的Spark作业调度器来满足多个QoS要求。搭建了DRL集群仿真环境,用于对作业调度器的核心DRL Agent进行训练。在调度环境中实现了基于绝对深度Q值网络、基于近端策略优化与广义优势估计联合的训练方法,使DRL Agent可以自适应地学习不同类型作业,以及动态、突发的集群环境特征,实现对Spark作业的合理调度,以降低集群总使用成本、缩短作业的平均响应时间。在基准套件上对DRL Agent测试的结果表明,与其他现有的Spark作业调度解决方案相比,本文设计的DRL Agent作业调度器在集群总使用成本、作业平均响应时间以及QoS达成率上具有显著的优越性,证明了其有效性。 展开更多
关键词 大数据计算 服务质量 Spark作业调度器 云环境 深度强化学习
在线阅读 下载PDF
A Novel Flexible Kernel Density Estimator for Multimodal Probability Density Functions
5
作者 Jia-Qi Chen Yu-Lin He +3 位作者 Ying-Chao Cheng philippe fournier-viger Ponnuthurai Nagaratnam Suganthan Joshua Zhexue Huang 《CAAI Transactions on Intelligence Technology》 2025年第6期1759-1782,共24页
Estimating probability density functions(PDFs)is critical in data analysis,particularly for complex multimodal distributions.traditional kernel density estimator(KDE)methods often face challenges in accurately capturi... Estimating probability density functions(PDFs)is critical in data analysis,particularly for complex multimodal distributions.traditional kernel density estimator(KDE)methods often face challenges in accurately capturing multimodal structures due to their uniform weighting scheme,leading to mode loss and degraded estimation accuracy.This paper presents the flexible kernel density estimator(F-KDE),a novel nonparametric approach designed to address these limitations.F-KDE introduces the concept of kernel unit inequivalence,assigning adaptive weights to each kernel unit,which better models local density variations in multimodal data.The method optimises an objective function that integrates estimation error and log-likelihood,using a particle swarm optimisation(PSO)algorithm that automatically determines optimal weights and bandwidths.Through extensive experiments on synthetic and real-world datasets,we demonstrated that(1)the weights and bandwidths in F-KDE stabilise as the optimisation algorithm iterates,(2)F-KDE effectively captures the multimodal characteristics and(3)F-KDE outperforms state-of-the-art density estimation methods regarding accuracy and robustness.The results confirm that F-KDE provides a valuable solution for accurately estimating multimodal PDFs. 展开更多
关键词 data analysis learning(artificial intelligence) machine learning optimisation PROBABILITY
在线阅读 下载PDF
自训练新类探测半监督学习算法 被引量:1
6
作者 何玉林 陈佳琪 +2 位作者 黄启航 菲律普弗尼尔-维格 黄哲学 《计算机科学与探索》 CSCD 北大核心 2023年第9期2184-2197,共14页
传统的半监督学习算法(SSL)存在适用范围有限和泛化能力不足的缺陷,尤其是当训练数据集中出现未见标签的新类样本时,算法的性能将在很大程度上受到影响。基于人工标注的有标记样本获取方式需要领域专家的参与,消耗了高昂的时间和财力成... 传统的半监督学习算法(SSL)存在适用范围有限和泛化能力不足的缺陷,尤其是当训练数据集中出现未见标签的新类样本时,算法的性能将在很大程度上受到影响。基于人工标注的有标记样本获取方式需要领域专家的参与,消耗了高昂的时间和财力成本,且由于专家背景知识的局限,无法避免标记过程中的人为错标现象。为此,以提高对未见标签样本标注正确性为出发点的半监督学习算法具有迫切的实际需要。在对自训练算法进行了详细剖析之后,提出了一种有效的新类探测半监督学习算法(NCD-SSL)。首先,基于经典的极限学习机模型,构造了可处理标签增量和样本增量学习的通用增量极限学习机;然后,对自训练算法进行改进,利用标注可信度高的样本进行样本增量学习,同时设置了缓存池用以存储标注可信度低的样本;之后,使用聚类和分布一致性判定方法进行新类探测,进而实现类增量学习;最后,在仿真数据集和真实数据集上对提出算法的可行性和有效性进行了实验验证,实验结果显示在缺失类别数为3、2、1时,新算法的测试精度普遍比其他6种半监督学习算法高出30、20、10个百分点左右,从而证实了提出的算法能够获得更好的新类探测半监督学习表现。 展开更多
关键词 半监督学习(SSL) 新类探测 自训练 极限学习机 最大平均差异 分布一致性
在线阅读 下载PDF
基于大数据随机样本划分的分布式观测点分类器
7
作者 李旭 何玉林 +2 位作者 崔来中 黄哲学 philippe fournier-viger 《计算机应用》 CSCD 北大核心 2024年第6期1727-1733,共7页
观测点分类器(OPC)是一种试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型,对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度,在Spark框架下设计一款基于大数... 观测点分类器(OPC)是一种试图通过将多维样本空间线性不可分问题转换成一维距离空间线性可分问题的有监督学习模型,对高维数据的分类问题尤为有效。针对OPC在处理大数据分类问题时表现的较高训练复杂度,在Spark框架下设计一款基于大数据的随机样本划分(RSP)的分布式OPC(DOPC)。首先,在分布式计算环境下生成大数据的RSP数据块,并将它转换为弹性分布式数据集(RDD);其次,在RSP数据块上协同式地训练一组OPC,由于每个RSP数据块上的OPC独立训练,因此有高效的Spark可实现性;最后,在Spark框架下将在RSP数据块上协同训练的OPC集成为DOPC,对新样本进行类标签预测。在8个大数据集上,对Spark集群环境下实现的DOPC的可行性、合理性和有效性进行实验验证,实验结果显示,DOPC能够以更低的计算消耗获得比单机OPC更高的测试精度,同时相较于Spark框架下实现的基于RSP模型的神经网络(NN)、决策树(DT)、朴素贝叶斯(NB)和K最近邻(KNN),DOPC分类器具有更强的泛化性能。测试结果表明,DOPC是一种高效低耗的处理大数据分类问题的有监督学习算法。 展开更多
关键词 大数据分类 分布式文件系统 随机样本划分 观测点分类器 Spark计算框架
在线阅读 下载PDF
基于分类风险的半监督集成学习算法
8
作者 何玉林 朱鹏辉 +1 位作者 黄哲学 philippe fournier-viger 《模式识别与人工智能》 EI CSCD 北大核心 2024年第4期339-351,共13页
针对当前半监督集成学习算法对无标记样本预测时容易出现的标注混沌问题,文中提出基于分类风险的半监督集成学习算法(Classification Risk-Based Semi-supervised Ensemble Learning Algorithm,CR-SSEL).采用分类风险作为无标记样本置... 针对当前半监督集成学习算法对无标记样本预测时容易出现的标注混沌问题,文中提出基于分类风险的半监督集成学习算法(Classification Risk-Based Semi-supervised Ensemble Learning Algorithm,CR-SSEL).采用分类风险作为无标记样本置信度的评判标准,可有效衡量样本标注的不确定性程度.迭代地训练分类器,对高置信度样本进行再强化,使样本标注的不确定性逐渐降低,增强半监督集成学习算法的分类性能.在多个标准数据集上验证CR-SSEL的学习参数影响、训练过程收敛和泛化性能提升,实验表明随着基分类器个数的增加,CR-SSEL的训练过程呈收敛趋势,获得较优的分类精度. 展开更多
关键词 半监督集成学习 集成学习 半监督学习 分类风险 不确定性 置信度
在线阅读 下载PDF
Multi-scale persistent spatiotemporal transformer for long-term urban traffic flow prediction
9
作者 Jia-Jun Zhong Yong Ma +3 位作者 Xin-Zheng Niu philippe fournier-viger Bing Wang Zu-kuan Wei 《Journal of Electronic Science and Technology》 EI CAS CSCD 2024年第1期53-69,共17页
Long-term urban traffic flow prediction is an important task in the field of intelligent transportation,as it can help optimize traffic management and improve travel efficiency.To improve prediction accuracy,a crucial... Long-term urban traffic flow prediction is an important task in the field of intelligent transportation,as it can help optimize traffic management and improve travel efficiency.To improve prediction accuracy,a crucial issue is how to model spatiotemporal dependency in urban traffic data.In recent years,many studies have adopted spatiotemporal neural networks to extract key information from traffic data.However,most models ignore the semantic spatial similarity between long-distance areas when mining spatial dependency.They also ignore the impact of predicted time steps on the next unpredicted time step for making long-term predictions.Moreover,these models lack a comprehensive data embedding process to represent complex spatiotemporal dependency.This paper proposes a multi-scale persistent spatiotemporal transformer(MSPSTT)model to perform accurate long-term traffic flow prediction in cities.MSPSTT adopts an encoder-decoder structure and incorporates temporal,periodic,and spatial features to fully embed urban traffic data to address these issues.The model consists of a spatiotemporal encoder and a spatiotemporal decoder,which rely on temporal,geospatial,and semantic space multi-head attention modules to dynamically extract temporal,geospatial,and semantic characteristics.The spatiotemporal decoder combines the context information provided by the encoder,integrates the predicted time step information,and is iteratively updated to learn the correlation between different time steps in the broader time range to improve the model’s accuracy for long-term prediction.Experiments on four public transportation datasets demonstrate that MSPSTT outperforms the existing models by up to 9.5%on three common metrics. 展开更多
关键词 Graph neural network Multi-head attention mechanism Spatio-temporal dependency Traffic flow prediction
在线阅读 下载PDF
Identifying Causes Helps a Tutoring System to Better Adapt to Learners during Training Sessions
10
作者 Usef Faghihi philippe fournier-viger +1 位作者 Roger Nkambou Pierre Poirier 《Journal of Intelligent Learning Systems and Applications》 2011年第3期139-154,共16页
This paper describes a computational model for the implementation of causal learning in cognitive agents. The Conscious Emotional Learning Tutoring System (CELTS) is able to provide dynamic fine-tuned assistance to us... This paper describes a computational model for the implementation of causal learning in cognitive agents. The Conscious Emotional Learning Tutoring System (CELTS) is able to provide dynamic fine-tuned assistance to users. The integration of a Causal Learning mechanism within CELTS allows CELTS to first establish, through a mix of datamining algorithms, gross user group models. CELTS then uses these models to find the cause of users' mistakes, evaluate their performance, predict their future behavior, and, through a pedagogical knowledge mechanism, decide which tutoring intervention fits best. 展开更多
关键词 Cognitive Agents Computational CAUSAL Modeling and Learning Emotions
暂未订购
Large Deviation Algorithms for Thresholding Bandit Problem
11
作者 Manjing Zhang Guangwu Liu +2 位作者 Shan Dai Jiaqi Chen philippe fournier-viger 《Big Data Mining and Analytics》 2025年第5期1189-1209,共21页
The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss functio... The Thresholding Bandit(TB)problem is a popular sequential decision-making problem,which aims at identifying the systems whose means are greater than a threshold.Instead of working on the upper bound of a loss function,our approach stands out from conventional practices by directly minimizing the loss itself.Leveraging the large deviation theory,we firstly provide an asymptotically optimal allocation rule for the TB problem,and then propose a parameter-free Large Deviation(LD)algorithm to make the allocation rule implementable.Central limit theorem-based Large Deviation(CLD)algorithm is further proposed as a supplement to improve the computation efficiency using normal approximation.Extensive experiments are conducted to validate the superiority of our algorithms compared to existing methods,and demonstrate their broader applications to more general distributions and various kinds of loss functions. 展开更多
关键词 Thresholding Bandit(TB)problem Large Deviation(LD)theory optimal allocation rule parameter-free policy asymptotical optimality
原文传递
CLS-Miner: efficient and effective closed high-utility itemset mining 被引量:10
12
作者 Thu-Lan DAM Kenli LI +1 位作者 philippe fournier-viger Quang-Huy DUONG 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第2期357-381,共25页
High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result,... High-utility itemset mining (HUIM) is a popular data mining task with applications in numerous domains. However, traditional HUIM algorithms often produce a very large set of high-utility itemsets (HUIs). As a result, analyzing HUIs can be very time consuming for users. Moreover, a large set of HUIs also makes HUIM algorithms less efficient in terms of execution time and memory consumption. To address this problem, closed high-utility itemsets (CHUIs), concise and lossless representations of all HUIs, were proposed recently. Although mining CHUIs is useful and desirable, it remains a computationally expensive task. This is because current algorithms often generate a huge number of candidate itemsets and are unable to prune the search space effectively. In this paper, we address these issues by proposing a novel algorithm called CLS-Miner. The proposed algorithm utilizes the utility-list structure to directly compute the utilities of itemsets without producing candidates. It also introduces three novel strategies to reduce the search space, namely chain-estimated utility co-occurrence pruning, lower branch pruning, and pruning by coverage. Moreover, an effective method for checking whether an itemset is a subset of another itemset is introduced to further reduce the time required for discovering CHUIs. To evaluate the performance of the proposed algorithm and its novel strategies, extensive experiments have been conducted on six benchmark datasets having various characteristics. Results show that the proposed strategies are highly efficient and effective, that the proposed CLS-Miner algorithm outperforms the current state-ofthe- art CHUD and CHUI-Miner algorithms, and that CLSMiner scales linearly. 展开更多
关键词 UTILITY MINING high-utility ITEMSET MINING CLOSED ITEMSET MINING CLOSED high-utility ITEMSET MINING
原文传递
A novel overlapping minimization SMOTE algorithm for imbalanced classification 被引量:1
13
作者 Yulin HE Xuan LU +1 位作者 philippe fournier-viger Joshua Zhexue HUANG 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2024年第9期1266-1281,共16页
The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its ... The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minorityclass sample point generation algorithm, named overlapping minimization SMOTE(OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes,support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the Git Hub platform at https://github.com/luxuan123123/OM-SMOTE/. 展开更多
关键词 Imbalanced classification Synthetic minority oversampling technique(SMOTE) Majority-class sample point Minority-class sample point Generalization capability Overlapping minimization
原文传递
Analysis and Classification of Fake News Using Sequential Pattern Mining 被引量:1
14
作者 M.Zohaib Nawaz M.Saqib Nawaz +1 位作者 philippe fournier-viger Yulin He 《Big Data Mining and Analytics》 EI CSCD 2024年第3期942-963,共22页
Disinformation,often known as fake news,is a major issue that has received a lot of attention lately.Many researchers have proposed effective means of detecting and addressing it.Current machine and deep learning base... Disinformation,often known as fake news,is a major issue that has received a lot of attention lately.Many researchers have proposed effective means of detecting and addressing it.Current machine and deep learning based methodologies for classification/detection of fake news are content-based,network(propagation)based,or multimodal methods that combine both textual and visual information.We introduce here a framework,called FNACSPM,based on sequential pattern mining(SPM),for fake news analysis and classification.In this framework,six publicly available datasets,containing a diverse range of fake and real news,and their combination,are first transformed into a proper format.Then,algorithms for SPM are applied to the transformed datasets to extract frequent patterns(and rules)of words,phrases,or linguistic features.The obtained patterns capture distinctive characteristics associated with fake or real news content,providing valuable insights into the underlying structures and commonalities of misinformation.Subsequently,the discovered frequent patterns are used as features for fake news classification.This framework is evaluated with eight classifiers,and their performance is assessed with various metrics.Extensive experiments were performed and obtained results show that FNACSPM outperformed other state-of-the-art approaches for fake news classification,and that it expedites the classification task with high accuracy. 展开更多
关键词 disinformation fake news sequential pattern mining(SPM) frequent patterns CLASSIFICATION
原文传递
Density estimation-based method to determine sample size for random sample partition of big data
15
作者 Yulin HE Jiaqi CHEN +2 位作者 Jiaxing SHEN philippe fournier-viger Joshua Zhexue HUANG 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第5期57-70,共14页
Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP... Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis.However,a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks.While a large sample size increases the burden of big data computation,a small size will lead to insufficient distribution information for RSP data blocks.To address this problem,this paper presents a novel density estimation-based method(DEM)to determine the optimal sample size for RSP data blocks.First,a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz(DKW)inequality by using the fixed-point iteration(FPI)method.Second,a practical sample size is determined by minimizing the validation error of a kernel density estimator(KDE)constructed on RSP data blocks for an increasing sample size.Finally,a series of persuasive experiments are conducted to validate the feasibility,rationality,and effectiveness of DEM.Experimental results show that(1)the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality;(2)the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function(p.d.f);and(3)DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f.estimation.This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation. 展开更多
关键词 random sample partition big data sample size Dvoretzky-Kiefer-Wolfowitz inequality kerneldensity estimator probability density function
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部