摘要
面向社交媒体的事件聚类旨在根据事件特征实现短文本聚类。目前,事件聚类模型主要分为无监督模型和有监督模型。无监督模型聚类效果较差,有监督聚类模型依赖大量标注数据。基于此,该文提出了一种半监督事件聚类模型(SemiEC),该模型在小规模标注数据的基础上,利用LSTM表征事件,并基于线性模型计算文本相似度,进行增量聚类。然后,利用增量聚类产生的标注数据对模型再训练,结束后对不确定样本再聚类。实验表明,SemiEC的性能相比基准模型有较大提升。
Event clustering on social text aims to cluster short texts according to event contents. Event clustering models can be divided into unsupervised learning or supervised learning at present. The unsupervised models suffer from poor performance, while the supervised models require lots of labeling data. To address the above issues, this paper proposes a semi-supervised incremental event clustering model SemiEC based on a small-scale annotated dataset. This model encodes the events by LSTM and calculates text similarity by a linear model. In particular, it uses the samples generated by incremental clustering to retrain the model and redistribute the uncertain samples. Experimental results show that the SemiEC model gets a better performance than the critical clustering algorithms.
作者
郭恒睿
王中卿
朱巧明
李培峰
GUO Hengrui;WANG Zhongqing;ZHU Qiaoming;LI Peifeng(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第2期152-159,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61772354,61836007)
国家自然科学基金青年基金(61806137)
江苏高校优势学科建设工程资助项目。
关键词
社交媒体事件聚类
增量聚类
文本相似度
event clustering on social text
incremental clustering
text similarity