摘要
将潜在语义索引(LSI)应用于垃圾邮件过滤领域,并将其与向量空间模型(VSM)和经典的邮件过滤器Spa-mAssassin系统进行比较.另外,对基于词提取技术的邮件文本特征集合和SpamAssassin系统提取的邮件"元特征"集合进行了对比.实验结果表明,LSI与VSM均取得了较SpamAssassin系统更优的分类效果.
The classification performance of latent semantic indexing (LSI) applied to the task of spare filtering is studied. Comparisons to the simple vector space model (VSM) and to the extremely widespread, de-facto standard for spare filtering, the SpamAssassin system, are summarized. Moreover, a set of purely textual features of E-mail messages that are based on standard word- and token-extraction techniques, and a set of application-specific "meta features" of E-mail messages as extracted by the SpamAssassin system are compared. It is shown that VSM and LSI achieve significantly better classification results than SpamAssassin.
出处
《郑州大学学报(理学版)》
CAS
北大核心
2010年第2期78-82,共5页
Journal of Zhengzhou University:Natural Science Edition
基金
教育部人文社会科学研究规划项目
编号09YJA630036
教育部人文社会科学研究青年基金项目
编号09YJC740027