微信会话文本关键词提取的算法研究

Study on Algorithm for Keyword Extraction from WeChat Conversation Text

下载PDF

导出

摘要微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。 WeChat group chats contain a large volume of conversational text data,and extracting keywords from these conversations helps to understand group dynamics and topic evolution.Traditional keyword extraction methods perform poorly due to the characteristics of WeChat conversations,such as short length,topic interleaving,and informal language use.To address these challenges,this paper proposes a multi-stage keyword extraction algorithm based on conversation topic clustering.First,we introduce a conversation topic clustering algorithm(single pass using thread segmentation and pre-training knowledge,SP_(TSPK)),addressing the issues of topic interleaving and insufficient information by comprehensively considering semantic relevance,message activityand user intimacy.Second,we propose a multi-stage keyword extraction algorithm(MSKE)that decomposes the task into unsupervised keyword extraction and supervised keyword generation to extract both present and absent keywords from the original text,reducing the scale of candidate words and semantic redundancy.Finally,we conbine SP_(TSPK) with MSKE to achieve keyword extraction from WeChat conversation texts.Compared to AutoKeyGen on the WeChat dataset,average F_(1)@5 and F_(1)@O increase by 12.8% and 10.8% respectively,and average R@10 reaches 2.59 times.Experimental results show that the proposed algorithm can effectively extract keywords from WeChat conversation texts.

作者王宝会许卜仁李长傲叶子豪 WANG Baohui;XU Boren;LI Chang’ao;YE Zihao(College of Software,Beihang University,Beijing 100191,China;School of Computing,Beihang University,Beijing 100191,China)

机构地区北京航空航天大学软件学院北京航空航天大学计算机学院

出处《计算机科学》北大核心 2025年第S1期239-246,共8页 Computer Science

关键词文本聚类文本生成会话主题聚类关键词提取 Text clustering Text generation Conversation topic clustering Keyword extraction

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1李天彩,王波,席耀一.基于多策略的短文本信息流会话抽取[J].计算机应用研究,2016,33(4):997-1002. 被引量：3
2田野,王文东,饶京海,王冠,郭亮,陈灿峰,马建.短信息的会话检测及组织[J].软件学报,2012,23(10):2586-2599. 被引量：4
3陈伟,吴友政,陈文亮,张民.基于BiLSTM-CRF的关键词自动抽取[J].计算机科学,2018,45(B06):91-96. 被引量：33
4李想,王卫兵,尚学达.指针生成网络和覆盖损失优化的Transformer在生成式文本摘要领域的应用[J].计算机应用,2021,41(6):1647-1651. 被引量：4

二级参考文献38

1夏云庆,黄锦辉,张普.中文网络聊天语言的奇异性与动态性研究[J].中文信息学报,2007,21(3):83-91. 被引量：8
2Phan X-H, Nguyen LM, Horiguchi S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proc. of the 17th Int'l Conf. on World Wide Web (WWW 2008). New York: ACM Press, 2008. 91-100. [doi: 10.1145/1367497.1367510].
3Cooper M, Foote J, Girgensohn A, Wilcox L. Temporal event clustering for digital photo collections. ACM Trans. on Multimedia Computing, Communications, and Applications (TOMCCAP), 2005,1 (3):269-288. [doi: 10.1145/1083314.1083317].
4Zhao QK, Mitra P. Event detection and visualization for social text streams. In: Proc. of the Int'l Conf. on Weblogs and Social Media (ICWSM 2007). Colorado, 2007.26-28. http://www.icwsm.org/papers/3--Zhao-Mitra.pdf.
5Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using Web search engines. In: Proc. of the 16th Int'l Conf. on World Wide Web (WWW 2007). New York: ACM Press, 2007. 757-766. [doi: 10.1145/1242572.1242675].
6Metzler D, Dumais S, Meek C. Similarity measures for short segments of text. In: Amati G, Carpineto C, Romano G, eds. Proc. of the 29th European Conf. on IR Research (ECIR 2007). Berlin, Heidelberg: Springer-Verlag, 2007. 16-27.
7Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 2000,63(2):411-423. [doi: 10.1111/1467-9868.00293].
8Tong HH, Sakurai Y, Eliassi-Rad T, Faloutsos C. Fast mining of complex time-stamped events. In: Proc. of the 17th ACM Conf. on Information and Knowledge Management (CIKM 2008). New York: ACM Press, 2008. 759-768. [doi: 10.1145/1458082.1458184].
9Kleinberg J. Bursty and hierarchical structure in streams. Journal of Data Mining and Knowledge Discovery, 2003,7(4):373-397. [doi: 10.1023/A: 1024940629314].
10Sun HJ, Wang SR, Jiang QS. FCM-Based model selection algorithms for determining the number of cluster. Pattern Recognition, 2004,37(10):2027-2037. [doi: 10.1016/j.patcog.2004.03.012].

共引文献40

1李建平.手法治疗骶髂关节错缝52例[J].按摩与导引,2000,16(3):52-53.
2周雪妍,杨静,林泽鸿,吉亚力.基于标题聚类的论坛舆论领袖发现算法[J].计算机工程与设计,2014,35(12):4316-4319. 被引量：3
3叶春明,李志,郑科栋,王勇.一种基于用户行为状态特征的流量识别方法[J].计算机应用研究,2015,32(2):560-564. 被引量：4
4李天彩,王波,席耀一.基于多策略的短文本信息流会话抽取[J].计算机应用研究,2016,33(4):997-1002. 被引量：3
5王媛媛,范潮钦,苏玉海.面向聊天记录的语义分析研究[J].信息网络安全,2017(9):89-92. 被引量：3
6李振,董晓晓,周东岱,童婷婷.自适应学习系统中知识图谱的人机协同构建方法与应用研究[J].现代教育技术,2019,29(10):80-86. 被引量：30
7朱晓霞,宋嘉欣,张晓缇.基于主题挖掘技术的文本情感分析综述[J].情报理论与实践,2019,42(11):156-163. 被引量：29
8黄炜,黄建桥,李岳峰.基于BiLSTM-CRF的涉恐信息实体识别模型研究[J].情报杂志,2019,38(12):149-156. 被引量：26
9吴俊,程垚,郝瀚,艾力亚尔·艾则孜,刘菲雪,苏亦坡.基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J].情报学报,2020,39(4):409-418. 被引量：67
10刘海姣,秦亮曦,秦川,苏永秀.基于Bi-LSTM的芒果产量预测[J].电子技术与软件工程,2020(8):188-189. 被引量：1

1付书凡,王中卿,姜晓彤.融合情感词典和图对比学习的中文零样本立场检测[J].计算机科学,2025,52(S1):255-261.
2主编的话[J].社会福利,2025(4):1-1.
3李苒笙,李志强,贾北洋.基于大模型的工业质检系统关键技术及应用[J].广东通信技术,2025,45(5):35-39.
4张荣臣.坚持用改革精神和严的标准管党治党——中国共产党执政党自身建设的主题演变和经验启示[J].求知,2025(6):26-30.
5黄思怡,王群.我国智慧养老政策主题演变及优化路径研究——基于2008—2024年政策文本的分析[J].社会福利,2025(4):22-35. 被引量：1
6Fei Guo,Renchu Guan,Yaohang Li,Qi Liu,Xiaowo Wang,Can Yang,Jianxin Wang.Foundation models in bioinformatics[J].National Science Review,2025,12(4):393-411. 被引量：1
7陈承柱.奥曲肽联合艾司奥美拉唑对比奥曲肽联合奥美拉唑治疗上消化道出血的药学分析[J].中文科技期刊数据库(引文版)医药卫生,2025(7):049-052.
8邱昕鹏,李晶.基于大语言模型的科技论文语义新颖性测度研究[J].情报理论与实践,2025,48(6):187-194. 被引量：3
9叶俊民,阙信超,张晨,宋艺爽,赵刚.融合情感与主题的协作会话质量检测算法[J].小型微型计算机系统,2025,46(5):1048-1055.
10周梦琪.二语习得背景下英语学习者课堂会话自我修补模式建设[J].齐齐哈尔大学学报(哲学社会科学版),2025(1):138-142.

计算机科学

2025年第S1期

浏览历史

内容加载中请稍等...

微信会话文本关键词提取的算法研究

参考文献4

二级参考文献38

共引文献40

相关作者

相关机构

相关主题

浏览历史