摘要
微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Knowledge,SP_(TSPK)),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_(TSPK)算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F_(1)@5和F_(1)@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。
WeChat group chats contain a large volume of conversational text data,and extracting keywords from these conversations helps to understand group dynamics and topic evolution.Traditional keyword extraction methods perform poorly due to the characteristics of WeChat conversations,such as short length,topic interleaving,and informal language use.To address these challenges,this paper proposes a multi-stage keyword extraction algorithm based on conversation topic clustering.First,we introduce a conversation topic clustering algorithm(single pass using thread segmentation and pre-training knowledge,SP_(TSPK)),addressing the issues of topic interleaving and insufficient information by comprehensively considering semantic relevance,message activityand user intimacy.Second,we propose a multi-stage keyword extraction algorithm(MSKE)that decomposes the task into unsupervised keyword extraction and supervised keyword generation to extract both present and absent keywords from the original text,reducing the scale of candidate words and semantic redundancy.Finally,we conbine SP_(TSPK) with MSKE to achieve keyword extraction from WeChat conversation texts.Compared to AutoKeyGen on the WeChat dataset,average F_(1)@5 and F_(1)@O increase by 12.8% and 10.8% respectively,and average R@10 reaches 2.59 times.Experimental results show that the proposed algorithm can effectively extract keywords from WeChat conversation texts.
作者
王宝会
许卜仁
李长傲
叶子豪
WANG Baohui;XU Boren;LI Chang’ao;YE Zihao(College of Software,Beihang University,Beijing 100191,China;School of Computing,Beihang University,Beijing 100191,China)
出处
《计算机科学》
北大核心
2025年第S1期239-246,共8页
Computer Science
关键词
文本聚类
文本生成
会话主题聚类
关键词提取
Text clustering
Text generation
Conversation topic clustering
Keyword extraction