期刊文献+
共找到6篇文章
< 1 >
每页显示 20 50 100
A Study of EM Algorithm as an Imputation Method: A Model-Based Simulation Study with Application to a Synthetic Compositional Data
1
作者 Yisa Adeniyi Abolade Yichuan Zhao 《Open Journal of Modelling and Simulation》 2024年第2期33-42,共10页
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode... Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance. 展开更多
关键词 Compositional Data Linear Regression Model Least Square Method Robust Least Square Method Synthetic Data Aitchison distance Maximum Likelihood Estimation Expectation-Maximization algorithm k-nearest neighbor and Mean imputation
在线阅读 下载PDF
MD-KNN算法在高校精准资助中的应用 被引量:1
2
作者 李博 李霞 +4 位作者 张晓 王艳秋 李恒 张勇 凌玉龙 《计算机技术与发展》 2020年第7期91-95,共5页
精准资助是当前一个热点问题,国内很多高校也对学生精准问题进行了深入的探索。为提升高校学生精准资助工作的准确性,采用MD-KNN算法(Mahalanobis distance k-nearest neighbor algorithm)对该问题进行分析。对收集到的数据信息利用基... 精准资助是当前一个热点问题,国内很多高校也对学生精准问题进行了深入的探索。为提升高校学生精准资助工作的准确性,采用MD-KNN算法(Mahalanobis distance k-nearest neighbor algorithm)对该问题进行分析。对收集到的数据信息利用基于马氏距离的MD-KNN算法进行聚类,再对聚类结果进行迭代分析,以提高经济困难学生筛选工作的精度。学生群体由于其本身的特殊性,其行为也会与贫困情况有联系,文中对学生行为与贫困情况进行分析:发现学生在学校食堂就餐次数、就餐天数与贫困指数具有正相关的联系。以西安某高校2017年11月至2018年4月学生行为数据为样本进行实验;用生成的名单与线下正常认证的贫困学生名单进行对比。实验证明MD-KNN算法在高校学生精准资助中具有很大的应用价值。 展开更多
关键词 MD-KNN算法 马氏距离 高校精准资助 聚类算法 数据挖掘
在线阅读 下载PDF
基于马氏距离和灰色分析的缺失值填充算法 被引量:6
3
作者 刘星毅 《计算机应用》 CSCD 北大核心 2009年第9期2502-2504,2536,共4页
针对kNN算法中欧氏距离具有密度相关性敏感的缺点,提出综合马氏距离和灰色分析方法代替kNN算法中欧式距离的新算法,应用到缺失数据填充方面。其中马氏距离能解决密度相关明显的数据集,灰色分析方法能处理密度相关不明显的情况。因此,该... 针对kNN算法中欧氏距离具有密度相关性敏感的缺点,提出综合马氏距离和灰色分析方法代替kNN算法中欧式距离的新算法,应用到缺失数据填充方面。其中马氏距离能解决密度相关明显的数据集,灰色分析方法能处理密度相关不明显的情况。因此,该算法能很好处理任何数据集,实验结果显示,算法在填充结果上明显优于现有的其他算法。 展开更多
关键词 数据预处理 缺失数据 最近邻算法 灰色分析 马氏距离
在线阅读 下载PDF
基于马氏距离的缺失数据填充算法 被引量:6
4
作者 刘星毅 檀大耀 +1 位作者 曾春华 韦小铃 《微计算机信息》 2010年第9期225-226,215,共3页
最近邻算法由于操作简单,效果显著,无论在科研还是实际生活中都具有广泛应用。文章首先解释了基于欧式距离的最近邻算法在计算两个记录之间距离方面的不足,然后提出了基于马氏距离的最近邻算法,真实数据集的实验结果显示,改进后的最近... 最近邻算法由于操作简单,效果显著,无论在科研还是实际生活中都具有广泛应用。文章首先解释了基于欧式距离的最近邻算法在计算两个记录之间距离方面的不足,然后提出了基于马氏距离的最近邻算法,真实数据集的实验结果显示,改进后的最近邻算法能取得较好的成绩。 展开更多
关键词 最近邻算法 数据缺失填充 马氏距离
在线阅读 下载PDF
基于密度峰值聚类算法的自适应加权过采样算法 被引量:2
5
作者 穆伟蒙 宋燕 窦军 《智能计算机与应用》 2022年第6期46-53,共8页
不平衡数据是监督学习中的一个挑战性问题。传统的分类器通常偏向多数类,忽略了少数类,而少数类样本往往包含很多重要信息,需要得到更多的关注。针对此问题,提出了一种基于密度峰值聚类算法的过采样技术(An Oversampling Technique base... 不平衡数据是监督学习中的一个挑战性问题。传统的分类器通常偏向多数类,忽略了少数类,而少数类样本往往包含很多重要信息,需要得到更多的关注。针对此问题,提出了一种基于密度峰值聚类算法的过采样技术(An Oversampling Technique based on Density Peak Clustering,DPCOTE)。DPCOTE的主要思想是:(1)利用k近邻算法去除多数类和少数类噪声样本;(2)基于密度峰值聚类算法(Density peaks clustering algorithm,DPC)中的2个重要因子,即样本局部密度和样本到局部密度较高的最近邻的距离,来为每个少数类样本分配采样权重;(3)对于DPC算法中涉及到的距离,使用马氏距离来度量,以消除样本特征量纲不一致问题。最后,在12个UCI数据集上进行了对比实验,用不同的指标评价分类结果,结果表明本文提出的算法在处理不平衡分类问题时优于其它过采样方法。 展开更多
关键词 不平衡数据 K近邻算法 密度峰值聚类算法 马氏距离
在线阅读 下载PDF
An up -to -date comparative analysis of the KNN classifier distance metrics for text categorization
6
作者 Onder Coban 《Data Science and Informetrics》 2023年第2期67-78,共12页
Text categorization(TC)is one of the widely studied branches of text mining and has many applications in different domains.It tries to automatically assign a text document to one of the predefined categories often by ... Text categorization(TC)is one of the widely studied branches of text mining and has many applications in different domains.It tries to automatically assign a text document to one of the predefined categories often by using machine learning(ML)techniques.Choosing the best classifier in this task is the most important step in which k-Nearest Neighbor(KNN)is widely employed as a classifier as well as several other well-known ones such as Support Vector Machine,Multinomial Naive Bayes,Logistic Regression,and so on.The KNN has been extensively used for TC tasks and is one of the oldest and simplest methods for pattern classification.Its performance crucially relies on the distance metric used to identify nearest neighbors such that the most frequently observed label among these neighbors is used to classify an unseen test instance.Hence,in this paper,a comparative analysis of the KNN classifier is performed on a subset(i.e.,R8)of the Reuters-21578 benchmark dataset for TC.Experimental results are obtained by using different distance metrics as well as recently proposed distance learning metrics under different cases where the feature model and term weighting scheme are different.Our comparative evaluation of the results shows that Bray-Curtis and Linear Discriminant Analysis(LDA)are often superior to the other metrics and work well with raw term frequency weights. 展开更多
关键词 Text categorization k-nearest neighbor distance metric distance learning algorithms
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部