期刊文献+

微信会话文本关键词提取的算法研究

Study on Algorithm for Keyword Extraction from WeChat Conversation Text
在线阅读 下载PDF
导出
摘要 微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。 WeChat group chats contain a large volume of conversational text data,and extracting keywords from these conversations helps to understand group dynamics and topic evolution.Traditional keyword extraction methods perform poorly due to the characteristics of WeChat conversations,such as short length,topic interleaving,and informal language use.To address these challenges,this paper proposes a multi-stage keyword extraction algorithm based on conversation topic clustering.First,we introduce a conversation topic clustering algorithm(single pass using thread segmentation and pre-training knowledge,SP_(TSPK)),addressing the issues of topic interleaving and insufficient information by comprehensively considering semantic relevance,message activityand user intimacy.Second,we propose a multi-stage keyword extraction algorithm(MSKE)that decomposes the task into unsupervised keyword extraction and supervised keyword generation to extract both present and absent keywords from the original text,reducing the scale of candidate words and semantic redundancy.Finally,we conbine SP_(TSPK) with MSKE to achieve keyword extraction from WeChat conversation texts.Compared to AutoKeyGen on the WeChat dataset,average F_(1)@5 and F_(1)@O increase by 12.8% and 10.8% respectively,and average R@10 reaches 2.59 times.Experimental results show that the proposed algorithm can effectively extract keywords from WeChat conversation texts.
作者 王宝会 许卜仁 李长傲 叶子豪 WANG Baohui;XU Boren;LI Chang’ao;YE Zihao(College of Software,Beihang University,Beijing 100191,China;School of Computing,Beihang University,Beijing 100191,China)
出处 《计算机科学》 北大核心 2025年第S1期239-246,共8页 Computer Science
关键词 文本聚类 文本生成 会话主题聚类 关键词提取 Text clustering Text generation Conversation topic clustering Keyword extraction
  • 相关文献

参考文献4

二级参考文献38

  • 1夏云庆,黄锦辉,张普.中文网络聊天语言的奇异性与动态性研究[J].中文信息学报,2007,21(3):83-91. 被引量:8
  • 2Phan X-H, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proc. of the 17th Int'l Conf. on World Wide Web (WWW 2008). New York: ACM Press, 2008. 91-100. [doi: 10.1145/1367497.1367510].
  • 3Cooper M, Foote J, Girgensohn A, Wilcox L. Temporal event clustering for digital photo collections. ACM Trans. on Multimedia Computing, Communications, and Applications (TOMCCAP), 2005,1 (3):269-288. [doi: 10.1145/1083314.1083317].
  • 4Zhao QK, Mitra P. Event detection and visualization for social text streams. In: Proc. of the Int'l Conf. on Weblogs and Social Media (ICWSM 2007). Colorado, 2007.26-28. http://www.icwsm.org/papers/3--Zhao-Mitra.pdf.
  • 5Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using Web search engines. In: Proc. of the 16th Int'l Conf. on World Wide Web (WWW 2007). New York: ACM Press, 2007. 757-766. [doi: 10.1145/1242572.1242675].
  • 6Metzler D, Dumais S, Meek C. Similarity measures for short segments of text. In: Amati G, Carpineto C, Romano G, eds. Proc. of the 29th European Conf. on IR Research (ECIR 2007). Berlin, Heidelberg: Springer-Verlag, 2007. 16-27.
  • 7Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 2000,63(2):411-423. [doi: 10.1111/1467-9868.00293].
  • 8Tong HH, Sakurai Y, Eliassi-Rad T, Faloutsos C. Fast mining of complex time-stamped events. In: Proc. of the 17th ACM Conf. on Information and Knowledge Management (CIKM 2008). New York: ACM Press, 2008. 759-768. [doi: 10.1145/1458082.1458184].
  • 9Kleinberg J. Bursty and hierarchical structure in streams. Journal of Data Mining and Knowledge Discovery, 2003,7(4):373-397. [doi: 10.1023/A: 1024940629314].
  • 10Sun HJ, Wang SR, Jiang QS. FCM-Based model selection algorithms for determining the number of cluster. Pattern Recognition, 2004,37(10):2027-2037. [doi: 10.1016/j.patcog.2004.03.012].

共引文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部