摘要
对Web使用挖掘的数据预处理的数据清理、用户识别、会话识别、路径补充和事务识别5个主要步骤的最新研究进展进行综述.采用拓扑结构结合引用页的路径补充算法和采用最大向前引用的事务识别算法,识别特性单一、对训练数据集的要求较高,故离实际应用还有一定的距离.针对此,从Cookie技术和启发式规则相结合、动态时间阈值法以及多特性融合等方面对数据预处理的用户识别、会话识别和事务识别提出了优化建议.
Advances in major steps of data preprocessing in the field of Web usage mining,including data cleaning,user identification,session identification,path complement and transaction identification were reviewed.The path complement algorithm using topology combines reference page and the session identification algorithm using maximum forward have feature a single identification and on the training data set with higher requirements.And there is quite far distance from real application.To optimize the algorithms in data pre-processing of user identification,session identification and transaction identification,several aspects such as Cookie technology and heuristic rules,the method of dynamic time threshold and method of multi-feature fusion are proposed.
出处
《郑州轻工业学院学报(自然科学版)》
CAS
2010年第4期71-74,共4页
Journal of Zhengzhou University of Light Industry:Natural Science
基金
湖南省教育厅资助科研项目(08C335)
湖南科技大学教学研究与改革重点项目(G30946)