In the rapidly evolving landscape of natural language processing(NLP)and sentiment analysis,improving the accuracy and efficiency of sentiment classification models is crucial.This paper investigates the performance o...In the rapidly evolving landscape of natural language processing(NLP)and sentiment analysis,improving the accuracy and efficiency of sentiment classification models is crucial.This paper investigates the performance of two advanced models,the Large Language Model(LLM)LLaMA model and NLP BERT model,in the context of airline review sentiment analysis.Through fine-tuning,domain adaptation,and the application of few-shot learning,the study addresses the subtleties of sentiment expressions in airline-related text data.Employing predictive modeling and comparative analysis,the research evaluates the effectiveness of Large Language Model Meta AI(LLaMA)and Bidirectional Encoder Representations from Transformers(BERT)in capturing sentiment intricacies.Fine-tuning,including domain adaptation,enhances the models'performance in sentiment classification tasks.Additionally,the study explores the potential of few-shot learning to improve model generalization using minimal annotated data for targeted sentiment analysis.By conducting experiments on a diverse airline review dataset,the research quantifies the impact of fine-tuning,domain adaptation,and few-shot learning on model performance,providing valuable insights for industries aiming to predict recommendations and enhance customer satisfaction through a deeper understanding of sentiment in user-generated content(UGC).This research contributes to refining sentiment analysis models,ultimately fostering improved customer satisfaction in the airline industry.展开更多
在中文电子病历命名实体识别(CNER)中,中文文本缺乏划分单词边界的分隔符,一些现有的方法难以捕捉长距离相互依赖的特征。因此,文章提出一种利用预训练模型(BERT-Transformer-CRF,BTC)实现CNER的命名实体识别方法。首先,运用BERT(Bidire...在中文电子病历命名实体识别(CNER)中,中文文本缺乏划分单词边界的分隔符,一些现有的方法难以捕捉长距离相互依赖的特征。因此,文章提出一种利用预训练模型(BERT-Transformer-CRF,BTC)实现CNER的命名实体识别方法。首先,运用BERT(Bidirectional Encoder Representations from Transformers)提取文本特征。其次,使用Transformer捕捉字符之间的依赖关系,此过程不需要考虑字符间的距离;此外,由于汉字的术语字典信息和部首信息包含更深层次的语义信息,所以将术语字典和部首的特征纳入模型以提高模型的性能。最后,运用CRF解码预测标签。实验结果表明所提模型在CCKS2017和CCKS2021数据集上的F1值分别达到了96.22%和84.65%,优于当前主流的命名实体识别模型,具有更好的识别效果。展开更多
目前在高校C语言编程课程中,使用客观评价的题目难度考验学生的学习情况是非常重要的手段。目前大部分难度评估方法都针对特有科目和特有题型,而对中文编程题目的难度评估存在不足。因此,提出一种融合题目文本和知识点标签的基于BERT(Bi...目前在高校C语言编程课程中,使用客观评价的题目难度考验学生的学习情况是非常重要的手段。目前大部分难度评估方法都针对特有科目和特有题型,而对中文编程题目的难度评估存在不足。因此,提出一种融合题目文本和知识点标签的基于BERT(Bidirectional Encoder Representations from Transformers)和双向长短时记忆(Bi-LSTM)模型的C语言题目难度预测模型FTKB-BiLSTM(Fusion of Title and Knowledge based on BERT and Bi-LSTM)。首先,利用BERT的中文预训练模型获得题目文本和知识点的词向量;其次,融合模块将融合后的信息通过BERT处理得到文本的信息表示,并输入Bi-LSTM模型中学习其中的序列信息,提取更丰富的特征;最后,把经Bi-LSTM模型得到的特征表示通过全连接层并经过Softmax函数处理得到题目难度分类结果。在Leetcode中文数据集和ZjgsuOJ平台数据集上的实验结果表明,相较于XLNet等主流的深度学习模型,所提模型的准确率更优,具有较强的分类能力。展开更多
Short Message Service(SMS)is a widely used and cost-effective communication medium that has unfortunately become a frequent target for unsolicited messages-commonly known as SMS spam.With the rapid adoption of smartph...Short Message Service(SMS)is a widely used and cost-effective communication medium that has unfortunately become a frequent target for unsolicited messages-commonly known as SMS spam.With the rapid adoption of smartphones and increased Internet connectivity,SMS spam has emerged as a prevalent threat.Spammers have recognized the critical role SMS plays in today’s modern communication,making it a prime target for abuse.As cybersecurity threats continue to evolve,the volume of SMS spam has increased substantially in recent years.Moreover,the unstructured format of SMS data creates significant challenges for SMS spam detection,making it more difficult to successfully combat spam attacks.In this paper,we present an optimized and fine-tuned transformer-based Language Model to address the problem of SMS spam detection.We use a benchmark SMS spam dataset to analyze this spam detection model.Additionally,we utilize pre-processing techniques to obtain clean and noise-free data and address class imbalance problem by leveraging text augmentation techniques.The overall experiment showed that our optimized fine-tuned BERT(Bidirectional Encoder Representations from Transformers)variant model RoBERTa obtained high accuracy with 99.84%.To further enhance model transparency,we incorporate Explainable Artificial Intelligence(XAI)techniques that compute positive and negative coefficient scores,offering insight into the model’s decision-making process.Additionally,we evaluate the performance of traditional machine learning models as a baseline for comparison.This comprehensive analysis demonstrates the significant impact language models can have on addressing complex text-based challenges within the cybersecurity landscape.展开更多
聚焦国家电网客服中心客户诉求数据治理中存在的效率低、人工依赖性强等问题,提出基于Transformer的双向编码器表征(Bidirectional Encoder Representations from Transformer,BERT)和双向长短时记忆(Bi-directional Long Short-Term Me...聚焦国家电网客服中心客户诉求数据治理中存在的效率低、人工依赖性强等问题,提出基于Transformer的双向编码器表征(Bidirectional Encoder Representations from Transformer,BERT)和双向长短时记忆(Bi-directional Long Short-Term Memory,BiLSTM)融合技术的多阶段联合数据治理框架。通过构建有效性判断、语义增强、诉求监测及业务场景分类等核心模块,形成覆盖数据预处理、语义分析、分类预测及诉求应用的全链路治理体系。结果验表明,提出的BERT与BiLSTM融合技术具有较好的性能指标。所提框架通过动态语义特征提取与上下文建模的协同机制,实现客户诉求的细粒度分类和风险点识别,验证基于BERT和BiLSTM的融合模型在电力企业文本类数据处理和应用中的适用性和有效性,为构建自动化数据治理体系提供了更丰富的解决方案。展开更多
Dialectal Arabic text classifcation(DA-TC)provides a mechanism for performing sentiment analysis on recent Arabic social media leading to many challenges owing to the natural morphology of the Arabic language and its ...Dialectal Arabic text classifcation(DA-TC)provides a mechanism for performing sentiment analysis on recent Arabic social media leading to many challenges owing to the natural morphology of the Arabic language and its wide range of dialect variations.Te availability of annotated datasets is limited,and preprocessing of the noisy content is even more challenging,sometimes resulting in the removal of important cues of sentiment from the input.To overcome such problems,this study investigates the applicability of using transfer learning based on pre-trained transformer models to classify sentiment in Arabic texts with high accuracy.Specifcally,it uses the CAMeLBERT model fnetuned for the Multi-Domain Arabic Resources for Sentiment Analysis(MARSA)dataset containing more than 56,000 manually annotated tweets annotated across political,social,sports,and technology domains.Te proposed method avoids extensive use of preprocessing and shows that raw data provides better results because they tend to retain more linguistic features.Te fne-tuned CAMeLBERT model produces state-of-the-art accuracy of 92%,precision of 91.7%,recall of 92.3%,and F1-score of 91.5%,outperforming standard machine learning models and ensemble-based/deep learning techniques.Our performance comparisons against other pre-trained models,namely AraBERTv02-twitter and MARBERT,show that transformer-based architectures are consistently the best suited when dealing with noisy Arabic texts.Tis work leads to a strong remedy for the problems in Arabic sentiment analysis and provides recommendations on easy tuning of the pre-trained models to adapt to challenging linguistic features and domain-specifc tasks.展开更多
Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or...Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or indirect slurs.To address this gap,we propose a hybrid framework combining Term Frequency-Inverse Document Frequency(TF-IDF),word-to-vector(Word2Vec),and Bidirectional Encoder Representations from Transformers(BERT)based models for multi-class cyberbullying detection.Our approach integrates TF-IDF for lexical specificity and Word2Vec for semantic relationships,fused with BERT’s contextual embeddings to capture syntactic and semantic complexities.We evaluate the framework on a publicly available dataset of 47,000 annotated social media posts across five cyberbullying categories:age,ethnicity,gender,religion,and indirect aggression.Among BERT variants tested,BERT Base Un-Cased achieved the highest performance with 93%accuracy(standard deviation across±1%5-fold cross-validation)and an average AUC of 0.96,outperforming standalone TF-IDF(78%)and Word2Vec(82%)models.Notably,it achieved near-perfect AUC scores(0.99)for age and ethnicity-based bullying.A comparative analysis with state-of-the-art benchmarks,including Generative Pre-trained Transformer 2(GPT-2)and Text-to-Text Transfer Transformer(T5)models highlights BERT’s superiority in handling ambiguous language.This work advances cyberbullying detection by demonstrating how hybrid feature extraction and transformer models improve multi-class classification,offering a scalable solution for moderating nuanced harmful content.展开更多
Background:Accurate classification of normal blood cells is a critical foundation for automated hematological analysis,including the detection of pathological conditions like leukemia.While convolutional neural networ...Background:Accurate classification of normal blood cells is a critical foundation for automated hematological analysis,including the detection of pathological conditions like leukemia.While convolutional neural networks(CNNs)excel in local feature extraction,their ability to capture global contextual relationships in complex cellular morphologies is limited.This study introduces a hybrid CNN-Transformer framework to enhance normal blood cell classification,laying the groundwork for future leukemia diagnostics.Methods:The proposed architecture integrates pre-trained CNNs(ResNet50,EfficientNetB3,InceptionV3,CustomCNN)with Vision Transformer(ViT)layers to combine local and global feature modeling.Four hybrid models were evaluated on the publicly available Blood Cell Images dataset from Kaggle,comprising 17,092 annotated normal blood cell images across eight classes.The models were trained using transfer learning,fine-tuning,and computational optimizations,including cross-model parameter sharing to reduce redundancy by reusing weights across CNN backbones and attention-guided layer pruning to eliminate low-contribution layers based on attention scores,improving efficiency without sacrificing accuracy.Results:The InceptionV3-ViT model achieved a weighted accuracy of 97.66%(accounting for class imbalance by weighting each class’s contribution),a macro F1-score of 0.98,and a ROC-AUC of 0.998.The framework excelled in distinguishing morphologically similar cell types demonstrating robustness and reliable calibration(ECE of 0.019).The framework addresses generalization challenges,including class imbalance and morphological similarities,ensuring robust performance across diverse cell types.Conclusion:The hybrid CNN-Transformer framework significantly improves normal blood cell classification by capturing multi-scale features and long-range dependencies.Its high accuracy,efficiency,and generalization position it as a strong baseline for automated hematological analysis,with potential for extension to leukemia subtype classification through future validation on pathological samples.展开更多
关系抽取是信息抽取技术的重要环节,旨在从无结构的文本中抽取出实体之间的关系.目前基于深度学习的实体关系抽取已经取得了一定的成果,但其特征提取不够全面,在各项实验指标方面仍有较大的提升空间.实体关系抽取不同于其他自然语言分...关系抽取是信息抽取技术的重要环节,旨在从无结构的文本中抽取出实体之间的关系.目前基于深度学习的实体关系抽取已经取得了一定的成果,但其特征提取不够全面,在各项实验指标方面仍有较大的提升空间.实体关系抽取不同于其他自然语言分类和实体识别等任务,它主要依赖于句子和两个目标实体的信息.本文根据实体关系抽取的特点,提出了SEF-BERT关系抽取模型(Fusion Sentence-Entity Features and Bert Model).该模型以预训练BERT模型为基础,文本在经过BERT模型预训练之后,进一步提取语句特征和实体特征.然后对语句特征和实体特征进行融合处理,使融合特征向量能够同时具有语句和两个实体的特征,增强了模型对特征向量的处理能力.最后,分别使用通用领域数据集和医学领域数据集对该模型进行了训练和测试.实验结果表明,与其他已有模型相比,SEF-BERT模型在两个数据集上都有更好的表现.展开更多
文摘In the rapidly evolving landscape of natural language processing(NLP)and sentiment analysis,improving the accuracy and efficiency of sentiment classification models is crucial.This paper investigates the performance of two advanced models,the Large Language Model(LLM)LLaMA model and NLP BERT model,in the context of airline review sentiment analysis.Through fine-tuning,domain adaptation,and the application of few-shot learning,the study addresses the subtleties of sentiment expressions in airline-related text data.Employing predictive modeling and comparative analysis,the research evaluates the effectiveness of Large Language Model Meta AI(LLaMA)and Bidirectional Encoder Representations from Transformers(BERT)in capturing sentiment intricacies.Fine-tuning,including domain adaptation,enhances the models'performance in sentiment classification tasks.Additionally,the study explores the potential of few-shot learning to improve model generalization using minimal annotated data for targeted sentiment analysis.By conducting experiments on a diverse airline review dataset,the research quantifies the impact of fine-tuning,domain adaptation,and few-shot learning on model performance,providing valuable insights for industries aiming to predict recommendations and enhance customer satisfaction through a deeper understanding of sentiment in user-generated content(UGC).This research contributes to refining sentiment analysis models,ultimately fostering improved customer satisfaction in the airline industry.
文摘在中文电子病历命名实体识别(CNER)中,中文文本缺乏划分单词边界的分隔符,一些现有的方法难以捕捉长距离相互依赖的特征。因此,文章提出一种利用预训练模型(BERT-Transformer-CRF,BTC)实现CNER的命名实体识别方法。首先,运用BERT(Bidirectional Encoder Representations from Transformers)提取文本特征。其次,使用Transformer捕捉字符之间的依赖关系,此过程不需要考虑字符间的距离;此外,由于汉字的术语字典信息和部首信息包含更深层次的语义信息,所以将术语字典和部首的特征纳入模型以提高模型的性能。最后,运用CRF解码预测标签。实验结果表明所提模型在CCKS2017和CCKS2021数据集上的F1值分别达到了96.22%和84.65%,优于当前主流的命名实体识别模型,具有更好的识别效果。
文摘针对现有的中文命名实体识别算法没有充分考虑实体识别任务的数据特征,存在中文样本数据的类别不平衡、训练数据中的噪声太大和每次模型生成数据的分布差异较大的问题,提出了一种以BERT-BiLSTM-CRF(Bidirectional Encoder Representations from Transformers-Bidirectional Long Short-Term Memory-Conditional Random Field)为基线改进的中文命名实体识别模型。首先在BERT-BiLSTM-CRF模型上结合P-Tuning v2技术,精确提取数据特征,然后使用3个损失函数包括聚焦损失(Focal Loss)、标签平滑(Label Smoothing)和KL Loss(Kullback-Leibler divergence loss)作为正则项参与损失计算。实验结果表明,改进的模型在Weibo、Resume和MSRA(Microsoft Research Asia)数据集上的F 1得分分别为71.13%、96.31%、95.90%,验证了所提算法具有更好的性能,并且在不同的下游任务中,所提算法易于与其他的神经网络结合与扩展。
文摘目前在高校C语言编程课程中,使用客观评价的题目难度考验学生的学习情况是非常重要的手段。目前大部分难度评估方法都针对特有科目和特有题型,而对中文编程题目的难度评估存在不足。因此,提出一种融合题目文本和知识点标签的基于BERT(Bidirectional Encoder Representations from Transformers)和双向长短时记忆(Bi-LSTM)模型的C语言题目难度预测模型FTKB-BiLSTM(Fusion of Title and Knowledge based on BERT and Bi-LSTM)。首先,利用BERT的中文预训练模型获得题目文本和知识点的词向量;其次,融合模块将融合后的信息通过BERT处理得到文本的信息表示,并输入Bi-LSTM模型中学习其中的序列信息,提取更丰富的特征;最后,把经Bi-LSTM模型得到的特征表示通过全连接层并经过Softmax函数处理得到题目难度分类结果。在Leetcode中文数据集和ZjgsuOJ平台数据集上的实验结果表明,相较于XLNet等主流的深度学习模型,所提模型的准确率更优,具有较强的分类能力。
文摘Short Message Service(SMS)is a widely used and cost-effective communication medium that has unfortunately become a frequent target for unsolicited messages-commonly known as SMS spam.With the rapid adoption of smartphones and increased Internet connectivity,SMS spam has emerged as a prevalent threat.Spammers have recognized the critical role SMS plays in today’s modern communication,making it a prime target for abuse.As cybersecurity threats continue to evolve,the volume of SMS spam has increased substantially in recent years.Moreover,the unstructured format of SMS data creates significant challenges for SMS spam detection,making it more difficult to successfully combat spam attacks.In this paper,we present an optimized and fine-tuned transformer-based Language Model to address the problem of SMS spam detection.We use a benchmark SMS spam dataset to analyze this spam detection model.Additionally,we utilize pre-processing techniques to obtain clean and noise-free data and address class imbalance problem by leveraging text augmentation techniques.The overall experiment showed that our optimized fine-tuned BERT(Bidirectional Encoder Representations from Transformers)variant model RoBERTa obtained high accuracy with 99.84%.To further enhance model transparency,we incorporate Explainable Artificial Intelligence(XAI)techniques that compute positive and negative coefficient scores,offering insight into the model’s decision-making process.Additionally,we evaluate the performance of traditional machine learning models as a baseline for comparison.This comprehensive analysis demonstrates the significant impact language models can have on addressing complex text-based challenges within the cybersecurity landscape.
文摘聚焦国家电网客服中心客户诉求数据治理中存在的效率低、人工依赖性强等问题,提出基于Transformer的双向编码器表征(Bidirectional Encoder Representations from Transformer,BERT)和双向长短时记忆(Bi-directional Long Short-Term Memory,BiLSTM)融合技术的多阶段联合数据治理框架。通过构建有效性判断、语义增强、诉求监测及业务场景分类等核心模块,形成覆盖数据预处理、语义分析、分类预测及诉求应用的全链路治理体系。结果验表明,提出的BERT与BiLSTM融合技术具有较好的性能指标。所提框架通过动态语义特征提取与上下文建模的协同机制,实现客户诉求的细粒度分类和风险点识别,验证基于BERT和BiLSTM的融合模型在电力企业文本类数据处理和应用中的适用性和有效性,为构建自动化数据治理体系提供了更丰富的解决方案。
基金funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University(IMSIU)(grant number IMSIU-DDRSP2504).
文摘Dialectal Arabic text classifcation(DA-TC)provides a mechanism for performing sentiment analysis on recent Arabic social media leading to many challenges owing to the natural morphology of the Arabic language and its wide range of dialect variations.Te availability of annotated datasets is limited,and preprocessing of the noisy content is even more challenging,sometimes resulting in the removal of important cues of sentiment from the input.To overcome such problems,this study investigates the applicability of using transfer learning based on pre-trained transformer models to classify sentiment in Arabic texts with high accuracy.Specifcally,it uses the CAMeLBERT model fnetuned for the Multi-Domain Arabic Resources for Sentiment Analysis(MARSA)dataset containing more than 56,000 manually annotated tweets annotated across political,social,sports,and technology domains.Te proposed method avoids extensive use of preprocessing and shows that raw data provides better results because they tend to retain more linguistic features.Te fne-tuned CAMeLBERT model produces state-of-the-art accuracy of 92%,precision of 91.7%,recall of 92.3%,and F1-score of 91.5%,outperforming standard machine learning models and ensemble-based/deep learning techniques.Our performance comparisons against other pre-trained models,namely AraBERTv02-twitter and MARBERT,show that transformer-based architectures are consistently the best suited when dealing with noisy Arabic texts.Tis work leads to a strong remedy for the problems in Arabic sentiment analysis and provides recommendations on easy tuning of the pre-trained models to adapt to challenging linguistic features and domain-specifc tasks.
基金funded by Scientific Research Deanship at University of Hail-Saudi Arabia through Project Number RG-23092.
文摘Cyberbullying on social media poses significant psychological risks,yet most detection systems over-simplify the task by focusing on binary classification,ignoring nuanced categories like passive-aggressive remarks or indirect slurs.To address this gap,we propose a hybrid framework combining Term Frequency-Inverse Document Frequency(TF-IDF),word-to-vector(Word2Vec),and Bidirectional Encoder Representations from Transformers(BERT)based models for multi-class cyberbullying detection.Our approach integrates TF-IDF for lexical specificity and Word2Vec for semantic relationships,fused with BERT’s contextual embeddings to capture syntactic and semantic complexities.We evaluate the framework on a publicly available dataset of 47,000 annotated social media posts across five cyberbullying categories:age,ethnicity,gender,religion,and indirect aggression.Among BERT variants tested,BERT Base Un-Cased achieved the highest performance with 93%accuracy(standard deviation across±1%5-fold cross-validation)and an average AUC of 0.96,outperforming standalone TF-IDF(78%)and Word2Vec(82%)models.Notably,it achieved near-perfect AUC scores(0.99)for age and ethnicity-based bullying.A comparative analysis with state-of-the-art benchmarks,including Generative Pre-trained Transformer 2(GPT-2)and Text-to-Text Transfer Transformer(T5)models highlights BERT’s superiority in handling ambiguous language.This work advances cyberbullying detection by demonstrating how hybrid feature extraction and transformer models improve multi-class classification,offering a scalable solution for moderating nuanced harmful content.
基金the Deanship of Graduate Studies and Scientific Research at Najran University,Saudi Arabia,for their financial support through the Easy Track Research program,grant code(NU/EFP/MRC/13).
文摘Background:Accurate classification of normal blood cells is a critical foundation for automated hematological analysis,including the detection of pathological conditions like leukemia.While convolutional neural networks(CNNs)excel in local feature extraction,their ability to capture global contextual relationships in complex cellular morphologies is limited.This study introduces a hybrid CNN-Transformer framework to enhance normal blood cell classification,laying the groundwork for future leukemia diagnostics.Methods:The proposed architecture integrates pre-trained CNNs(ResNet50,EfficientNetB3,InceptionV3,CustomCNN)with Vision Transformer(ViT)layers to combine local and global feature modeling.Four hybrid models were evaluated on the publicly available Blood Cell Images dataset from Kaggle,comprising 17,092 annotated normal blood cell images across eight classes.The models were trained using transfer learning,fine-tuning,and computational optimizations,including cross-model parameter sharing to reduce redundancy by reusing weights across CNN backbones and attention-guided layer pruning to eliminate low-contribution layers based on attention scores,improving efficiency without sacrificing accuracy.Results:The InceptionV3-ViT model achieved a weighted accuracy of 97.66%(accounting for class imbalance by weighting each class’s contribution),a macro F1-score of 0.98,and a ROC-AUC of 0.998.The framework excelled in distinguishing morphologically similar cell types demonstrating robustness and reliable calibration(ECE of 0.019).The framework addresses generalization challenges,including class imbalance and morphological similarities,ensuring robust performance across diverse cell types.Conclusion:The hybrid CNN-Transformer framework significantly improves normal blood cell classification by capturing multi-scale features and long-range dependencies.Its high accuracy,efficiency,and generalization position it as a strong baseline for automated hematological analysis,with potential for extension to leukemia subtype classification through future validation on pathological samples.
文摘关系抽取是信息抽取技术的重要环节,旨在从无结构的文本中抽取出实体之间的关系.目前基于深度学习的实体关系抽取已经取得了一定的成果,但其特征提取不够全面,在各项实验指标方面仍有较大的提升空间.实体关系抽取不同于其他自然语言分类和实体识别等任务,它主要依赖于句子和两个目标实体的信息.本文根据实体关系抽取的特点,提出了SEF-BERT关系抽取模型(Fusion Sentence-Entity Features and Bert Model).该模型以预训练BERT模型为基础,文本在经过BERT模型预训练之后,进一步提取语句特征和实体特征.然后对语句特征和实体特征进行融合处理,使融合特征向量能够同时具有语句和两个实体的特征,增强了模型对特征向量的处理能力.最后,分别使用通用领域数据集和医学领域数据集对该模型进行了训练和测试.实验结果表明,与其他已有模型相比,SEF-BERT模型在两个数据集上都有更好的表现.