摘要
针对多标签文本分类精度受原始数据不均影响的问题,提出了一种基于余弦相似度树结构的方法——BCL-tree模型,在保持原始数据分布的基础上,引入标签间相关性以提升分类精度。首先,根据标签的样本量,选择样本量大的作为构建树的根节点;其次,利用余弦相似度计算其他小样本标签与根节点的相关程度,并根据相关程度的大小将标签以树结构存储;最后,将标签树嵌入集成到BCL模型中,从而帮助分类器更好地理解标签间的关系。在模型前向传播过程中,标签树嵌入与BCL输出的结合提高了模型的分类性能。在2个不同的数据集上,对BCL-tree模型与4个基线模型进行了比较分析。实验结果表明,BCL-tree模型在两个不同的数据集上的F1分数分别较基线模型中最优的CNN方法提升了4.5%,8.8%,验证了该方法在分类精度和标签相关性建模方面的优越性。
Addressing the issue of multi-label text classification accuracy being affected by imbalanced original data,a method based on cosine similarity tree structure,the BCL-tree model,is proposed,on the basis of maintaining the original data distribution,the inter-label correlation is introduceed to improve classification accuracy.Firstly,based on the sample size of the labels,the one with a larger sample size is selected as the root node for constructing the tree.Secondly,the cosine similarity is used to calculate the correlation between other small sample labels and the root node,and the labels are stored in a tree structure according to the degree of correlation.Finally,the label tree embedding is integrated into the BCL-tree model,thereby aiding the classifier in better understanding the relationships between labels.By combining the label tree embedding with the BCL output during the forward propagation process of the model,the classification performance of the model is improved.A comparative analysis is conducted between the BCL model and four baseline models on two distinct datasets.The experimental results indicated that the F1 scores of the BCL-tree model on two different datasets have improved by 4.5%and 8.8%respectively compared to the optimal CNN method in the baseline model,respectively,thus validating the superiority of this method in classification accuracy and label correlation modeling.
作者
崔海越
李延玲
CUI Haiyue;LI YanLing(School of Mathematics and Statistics,Qinghai Minzu University,Xining 810000,China)
出处
《高师理科学刊》
2025年第8期18-24,共7页
Journal of Science of Teachers'College and University
基金
青海民族大学研究生创新项目(07M2024013)
青海省自然科学基金项目(2023-ZJ-949Q)。
关键词
多标签文本分类
样本分布不均
BERT模型
余弦相似度
multi-label text classification
imbalanced sample distribution
BERT model
cosine similarity