期刊文献+
共找到1,517篇文章
< 1 2 76 >
每页显示 20 50 100
Out-of-distribution Detection for Power System Text Data by Enhanced Mahalanobis Distance with Calibration
1
作者 Yixiang Zhang Huifang Wang +3 位作者 Yuzhen Zheng Zhengming Fei Hui Zhou Huafeng Luo 《Protection and Control of Modern Power Systems》 2026年第1期40-52,共13页
The increasing significance of text data in power system intelligence has highlighted the out-of-distribution(OOD)problem as a critical challenge,hindering the deployment of artificial intelligence(AI)models.In a clos... The increasing significance of text data in power system intelligence has highlighted the out-of-distribution(OOD)problem as a critical challenge,hindering the deployment of artificial intelligence(AI)models.In a closed-world setting,most AI models cannot detect and reject unexpected data,which exacerbates the harmful impact of the OOD problem.The high similarity between OOD and indistribution(IND)samples in the power system presents challenges for existing OOD detection methods in achieving effective results.This study aims to elucidate and address the OOD problem in power systems through a text classification task.First,the underlying causes of OOD sample generation are analyzed,highlighting the inherent nature of the OOD problem in the power system.Second,a novel method integrating the enhanced Mahalanobis distance with calibration strategies is introduced to improve OOD detection for text data in power system applications.Finally,the case study utilizing the actual text data from power system field operation(PSFO)is conducted,demonstrating the effectiveness of the proposed OOD detection method.Experimental results indicate that the proposed method outperformed existing methods in text OOD detection tasks within the power system,achieving a remarkable 21.03%enhancement of metric in the false positive rate at 95%true positive recall(FPR95)and a 12.97%enhancement in classi-fication accuracy for the mixed IND-OOD scenarios. 展开更多
关键词 Out-of-distribution detection text clas-sification text data applications in power grid machine learning natural language processing
在线阅读 下载PDF
An alert-situation text data augmentation method based on MLM
2
作者 DING Weijie MAO Tingyun +3 位作者 CHEN Lili ZHOU Mingwei YUAN Ying HU Wentao 《High Technology Letters》 EI CAS 2024年第4期389-396,共8页
The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,i... The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method. 展开更多
关键词 deep learning text data augmentation masked language model(MLM) alert-sit-uation text classification
在线阅读 下载PDF
Quantitative Comparative Study of the Performance of Lossless Compression Methods Based on a Text Data Model
3
作者 Namogo Silué Sié Ouattara +1 位作者 Mouhamadou Dosso Alain Clément 《Open Journal of Applied Sciences》 2024年第7期1944-1962,共19页
Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their perform... Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their performance is exploited with lossy compression techniques for images and videos generally using a mixed approach. To achieve our intended objective, which is to study the performance of lossless compression methods, we first carried out a literature review, a summary of which enabled us to select the most relevant, namely the following: arithmetic coding, LZW, Tunstall’s algorithm, RLE, BWT, Huffman coding and Shannon-Fano. Secondly, we designed a purposive text dataset with a repeating pattern in order to test the behavior and effectiveness of the selected compression techniques. Thirdly, we designed the compression algorithms and developed the programs (scripts) in Matlab in order to test their performance. Finally, following the tests conducted on relevant data that we constructed according to a deliberate model, the results show that these methods presented in order of performance are very satisfactory:- LZW- Arithmetic coding- Tunstall algorithm- BWT + RLELikewise, it appears that on the one hand, the performance of certain techniques relative to others is strongly linked to the sequencing and/or recurrence of symbols that make up the message, and on the other hand, to the cumulative time of encoding and decoding. 展开更多
关键词 Arithmetic Coding BWT Compression Ratio Comparative Study Compression Techniques Shannon-Fano HUFFMAN Lossless Compression LZW PERFORMANCE REDUNDANCY RLE text data Tunstall
在线阅读 下载PDF
Clustering Text Data Streams 被引量:7
4
作者 刘玉葆 蔡嘉荣 +1 位作者 印鉴 傅蔚慈 《Journal of Computer Science & Technology》 SCIE EI CSCD 2008年第1期112-128,共17页
Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. Howe... Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF,IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF,IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique. 展开更多
关键词 CLUSTERING database applications data mining text data streams
原文传递
Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation 被引量:4
5
作者 Jiao Li Si Zheng +2 位作者 Hongyu Kang Zhen Hou Qing Qian 《Journal of Data and Information Science》 2016年第2期32-44,共13页
Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library arc... Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas(TCGA), via a full-text literature analysis.Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from Pub Med Central(PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing(RNA-seq) platform is the most preferable for use.Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data. 展开更多
关键词 Scientific data Full-text literature Open access PubMed Central data citation
在线阅读 下载PDF
Automatic User Goals Identification Based on Anchor Text and Click-Through Data 被引量:6
6
作者 YUAN Xiaojie DOU Zhicheng ZHANG Lu LIU Fang 《Wuhan University Journal of Natural Sciences》 CAS 2008年第4期495-500,共6页
Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to th... Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to the goals. Four novel entropy-based features extracted from anchor data and click-through data are proposed, and a support vector machines (SVM) classifier is used to identify the user's goal based on these features. Experi- mental results show that the proposed entropy-based features are more effective than those reported in previous work. By combin- ing multiple features the goals for more than 97% of the queries studied can be correctly identified. Besides these, this paper reaches the following important conclusions: First, anchor-based features are more effective than click-through-based features; Second, the number of sites is more reliable than the number of links; Third, click-distribution- based features are more effective than session-based ones. 展开更多
关键词 query classification user goals anchor text click-through data information retrieval
在线阅读 下载PDF
A Complexity Analysis and Entropy for Different Data Compression Algorithms on Text Files 被引量:1
7
作者 Mohammad Hjouj Btoush Ziad E. Dawahdeh 《Journal of Computer and Communications》 2018年第1期301-315,共15页
In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorith... In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorithms on different files of different sizes and then conclude that: LZW is the best one in all compression scales that we tested especially on the large files, then Huffman, HFLC, and FLC, respectively. Data compression still is an important topic for research these days, and has many applications and uses needed. Therefore, we suggest continuing searching in this field and trying to combine two techniques in order to reach a best one, or use another source mapping (Hamming) like embedding a linear array into a Hypercube with other good techniques like Huffman and trying to reach good results. 展开更多
关键词 text FILES data Compression HUFFMAN Coding LZW Hamming ENTROPY COMPLEXITY
暂未订购
A feature representation method for biomedical scientific data based on composite text description
8
作者 SUN Wei 《Chinese Journal of Library and Information Science》 2009年第4期43-53,共11页
Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Ther... Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering. 展开更多
关键词 Composite text description Scientific data Feature representation Weight algorism
原文传递
Fast Data Processing of a Polarimeter-Interferometer System on J-TEXT
9
作者 刘煜锴 高丽 +3 位作者 刘海庆 杨曜 高翔 J-TEXT Team 《Plasma Science and Technology》 SCIE EI CAS CSCD 2016年第12期1143-1147,共5页
A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversio... A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversion method, evolution of the density profile analyzed by this method can quickly offer important information. This method has the advantage of fast calculation speed with the order of ten milliseconds per normal shot and it is capable of processing up to 1 MHz sampled data, which is helpful for studying density sawtooth instability and the disruption between shots. In the duration of a flat-top plasma current of usual ohmic discharges on J-TEXT, shape factor u is ranged from 4 to 5. When the disruption of discharge happens, the density profile becomes peaked and the shape factor u typically decreases to 1. 展开更多
关键词 fast data processing polarimeter-interferometer J-text
在线阅读 下载PDF
基于SVM和归一化熵模型的隐患文本分类与类型特征分析
10
作者 乔剑锋 刘萱 +2 位作者 艾莉莎 张丽玮 王汀 《重庆大学学报》 北大核心 2026年第2期105-115,共11页
为了提高隐患信息数据组织和检索的效率,支持更复杂的信息处理任务,需要采用有效技术手段对数据进行自动分类和类型分析。支持向量机(support vector machine,SVM)可以对自由文本进行自动分类,但是算法的工作原理是在训练集中寻找最优... 为了提高隐患信息数据组织和检索的效率,支持更复杂的信息处理任务,需要采用有效技术手段对数据进行自动分类和类型分析。支持向量机(support vector machine,SVM)可以对自由文本进行自动分类,但是算法的工作原理是在训练集中寻找最优分类边界,不能发现类型典型特征。为了分析类型样本的共同特征,提出采用归一化熵模型寻找类型典型特征,改进当前词频-逆文档频率(term frequency-inverse document frequency,TF-IDF)类型特征识别方法。以政府某应急管理局的2 534条执法检查记录为例,采用SVM进行自动分类,准确率高达97%。同时通过归一化熵模型给出各类型的典型特征,为制定隐患排查专项整治策略提供决策支持。实验结果表明,采用SVM和归一化熵模型的组合技术可以高效解决文本分类和类型特征识别的综合问题。 展开更多
关键词 文本挖掘 数据挖掘 隐患排查 支持向量机
在线阅读 下载PDF
融合类别描述与增强嵌入的煤矿安全风险预测模型研究
11
作者 杨超宇 黄大卫 《安全与环境学报》 北大核心 2026年第2期517-528,共12页
煤矿安全风险辨识文本包含丰富的风险特征描述与专家经验知识,深入挖掘这些文本对实现风险等级预测具有重要价值。针对风险辨识文本存在小样本、短文本及语义复杂问题,提出了一种融合类别描述与增强嵌入的煤矿安全风险预测模型。该方法... 煤矿安全风险辨识文本包含丰富的风险特征描述与专家经验知识,深入挖掘这些文本对实现风险等级预测具有重要价值。针对风险辨识文本存在小样本、短文本及语义复杂问题,提出了一种融合类别描述与增强嵌入的煤矿安全风险预测模型。该方法在句子级嵌入维度对文本进行数据增强,有效扩充训练样本;通过构建风险类别描述引入煤矿领域知识,并利用注意力机制对风险类别描述进行动态融合,为煤矿安全风险样本补充专业知识;使用双向长短期记忆(Bidirectional Long Short-Term Memory,Bi-LSTM)网络与Mamba算法对原始文本特征进行深度提取,获取煤矿文本复杂情境下的核心特征;最后使用动态门控机制融合各模块特征,输出预测结果。研究表明,该模型在小规模煤矿风险辨识数据集上准确率和F1均有不错的表现,可基于煤矿安全风险辨识文本为煤矿安全风险等级预测提供支持。 展开更多
关键词 安全工程 风险预测 煤矿安全 小样本 短文本 数据增强 特征融合
原文传递
我国省域公共数据治理政策量化评价——基于PMC指数模型
12
作者 李春林 张小亚 《科技智囊》 2026年第2期58-67,共10页
[研究目的]对我国省域公共数据治理政策进行量化评价,为制定和优化省域公共数据治理政策提供理论依据和对策建议。[研究方法]选取2015—2024年共计95份省域公共数据治理政策文本作为研究对象,采用ROST CM6软件进行文本挖掘,构建省域公... [研究目的]对我国省域公共数据治理政策进行量化评价,为制定和优化省域公共数据治理政策提供理论依据和对策建议。[研究方法]选取2015—2024年共计95份省域公共数据治理政策文本作为研究对象,采用ROST CM6软件进行文本挖掘,构建省域公共数据治理政策的PMC指数模型,对我国省域公共数据治理政策进行量化评价。[研究结论]我国省域公共数据治理政策整体质量较高,除中国香港、中国澳门、中国台湾地区外的31个省(区、市)中,有6个达到优秀等级、23个处于良好等级、2个处于可接受等级,其在政策工具、政策领域、政策评价等方面均表现优秀,但在政策时效、政策效力等方面还存在不足。为了提升政策效能,省域政府应该完善政策时效体系、强化政策内容协同、加强政策主体协同以及激发政策客体作用。 展开更多
关键词 省域 公共数据治理 政策文本 PMC指数模型 政策评价
在线阅读 下载PDF
地方政务数据安全治理政策的量化分析——基于“工具—目标—过程”的三维分析框架
13
作者 邓崧 吴宇 《北京航空航天大学学报(社会科学版)》 2026年第2期82-94,共13页
地方政务数据安全治理政策对促进数据要素健康发展,提升地方政府治理能力和公共服务水平具有重要意义。以地方政府发布的151份政务数据安全治理政策为研究对象,构建“政策工具—政策目标—治理过程”的三维分析框架,并综合采用文献计量... 地方政务数据安全治理政策对促进数据要素健康发展,提升地方政府治理能力和公共服务水平具有重要意义。以地方政府发布的151份政务数据安全治理政策为研究对象,构建“政策工具—政策目标—治理过程”的三维分析框架,并综合采用文献计量法与内容分析法对地方政务数据安全治理政策文本进行量化分析。研究表明:地方政务数据安全治理政策呈现“起伏发展—急剧增长—缓慢上升”的发展态势;政策工具结构失衡,推力与拉力发挥不足,整体呈现“重环境、轻供给、弱需求”的结构特点;政策目标协同不力,治理集成效应不彰显;治理过程覆盖不均,顶格部署管理有缺位。在未来治理进程中,需进一步平衡优化政策工具结构,增强政策的协调性与稳定性、深化敏感数据管控措施,合力推进数据治理提质增效、完善数据安全战略布局,高位推动数据安全工作落地,以切实提高地方政府数据安全治理能力。 展开更多
关键词 地方政府 政务数据安全治理 政策文本 政策量化 内容分析法
在线阅读 下载PDF
EOS Data Dumper——EOS免费数据自动下载与重发布系统 被引量:5
14
作者 南卓铜 王亮绪 李新 《冰川冻土》 CSCD 北大核心 2007年第3期463-469,共7页
为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web... 为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web页面文本信息捕捉技术,实现定时自动下载研究区的全部EOS免费数据,并通过免费的DIAL系统,向互联网重新发布,实现复杂的基于时空的数据查询.从技术角度详细介绍了EDD的项目背景与意义、实现方案。 展开更多
关键词 EOS数据 遥感影像数据 文本信息捕捉 数据共享
在线阅读 下载PDF
基于开源大语言模型的文本增强框架
15
作者 霍浩鑫 管卫利 方志杰 《计算机应用研究》 北大核心 2026年第3期842-850,共9页
针对传统数据增强方法仅在数据层面操作,难以突破语义空间封闭性的问题,提出了一种基于DeepSeek的文本增强框架(DS-Aug)。该框架利用推理链提示模板生成语义一致且具领域特异性的增强样本和蒸馏知识,并设计知识感知迁移融合模型,通过注... 针对传统数据增强方法仅在数据层面操作,难以突破语义空间封闭性的问题,提出了一种基于DeepSeek的文本增强框架(DS-Aug)。该框架利用推理链提示模板生成语义一致且具领域特异性的增强样本和蒸馏知识,并设计知识感知迁移融合模型,通过注意力门控机制动态注入领域相关外部知识,从而提升模型的知识迁移与小样本学习能力。在BBC新闻主题分类和MR影评情感分析两个公开数据集上的实验结果表明,DS-Aug在准确率和F_(1)分数上均优于传统增强方法和部分预训练模型。结果验证了DS-Aug在提升分类性能方面的有效性,尤其在小样本与跨领域任务中表现出较强的鲁棒性。该研究为开源大语言模型在领域文本分类中的应用提供了新的思路与参考。 展开更多
关键词 文本分类 数据增强 大语言模型 知识迁移
在线阅读 下载PDF
大语言模型驱动的图书馆业务数据对话系统设计与实现
16
作者 张光照 王忠义 +2 位作者 王楠 张银玲 杨帆 《图书馆论坛》 北大核心 2026年第3期124-134,共11页
探索微调大语言模型的方法,使其能通过自然语言描述需求生成SQL查询语句,获取图书馆业务数据库报表,以解决现有业务系统查询功能固化,难以应对实际工作中多元化的需求的问题。文章根据图书馆各业务部门报表分析和决策需求,针对图书馆业... 探索微调大语言模型的方法,使其能通过自然语言描述需求生成SQL查询语句,获取图书馆业务数据库报表,以解决现有业务系统查询功能固化,难以应对实际工作中多元化的需求的问题。文章根据图书馆各业务部门报表分析和决策需求,针对图书馆业务数据库和中图分类法特征制作图书馆领域Text-to-SQL数据训练集和测试集;基于6-7B参数量选取国内外典型的开源中等参数规模大语言模型在训练集上进行LoRA微调,而后在测试集上验证SQL查询语句生成有效性,最后利用FastChat部署模型和研发可视化交互系统实现交互能力并在真实场景中分析应用效果。研究发现,在Chatglm3-6B模型微调中取得了0.9426的执行准确度,并在真实业务需求交互场景中探讨了微调后的中等参数规模大语言模型与超大规模参数模型Qwen-Max的应用效果,证明采用Textto-SQL微调大语言模型用于图书馆业务数据自然语言查询的有效性。 展开更多
关键词 图书馆业务数据 text-to-SQL 大语言模型 数据库对话系统 微调
在线阅读 下载PDF
BiGRU-Attention在高速公路机电设备故障预测中的应用
17
作者 文锦韬 唐向红 +1 位作者 陆见光 黎红志 《哈尔滨理工大学学报》 北大核心 2026年第1期47-58,共12页
针对当前高速公路机电设备故障等级预测中面临的故障文本数据内容短、专业词汇多带来的数据理解困难及特征提取复杂的问题,提出了Bi GRU-Attention故障预测模型。首先,该模型利用Word2Vec算法对非结构化的故障数据进行词向量训练,构建... 针对当前高速公路机电设备故障等级预测中面临的故障文本数据内容短、专业词汇多带来的数据理解困难及特征提取复杂的问题,提出了Bi GRU-Attention故障预测模型。首先,该模型利用Word2Vec算法对非结构化的故障数据进行词向量训练,构建高质量的词汇表征,从而解决专业词汇带来的理解难题。其次,通过双向门控循环单元捕捉文本的时序信息和上下文关联,提取短文本的特征。同时,引入注意力机制,使模型能够自动调整关键特征值的权重,进一步提升分类性能。实验结果表明:该模型在高速公路机电设备故障数据集上取得了显著效果,准确率和运行时间分别达到96.8%和155 s,为高速公路机电设备的故障等级预测提供了高效且准确的解决方案。 展开更多
关键词 高速公路机电设备 故障预测 故障文本数据 注意力机制 双向门控循环单元
在线阅读 下载PDF
Semi-Supervised Learning in Large Scale Text Categorization 被引量:1
18
作者 许泽文 李建强 +3 位作者 刘博 毕敬 李蓉 毛睿 《Journal of Shanghai Jiaotong university(Science)》 EI 2017年第3期291-302,共12页
The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately beca... The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization(TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text(FACT)based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine(SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC. 展开更多
关键词 text data mining SEMI-SUPERVISED automatic tagging CLASSIFIER
原文传递
融合多级语义的中文医疗短文本分类模型
19
作者 杨杰 刘纳 +2 位作者 郑国风 李晨 道路 《郑州大学学报(理学版)》 北大核心 2026年第1期51-57,共7页
针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语... 针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语义;采用注意力池化技术聚焦文本中的文档级信息,增强对医学专业术语和概念的识别与理解。最后,引入对抗训练策略,提升模型在面对模糊表达或扰动输入时的稳定性和准确性。在CHIP-CTC、KUAKE_QIC和VSQ三个医疗文本分类数据集上验证了模型的有效性,结果表明,相较于现有模型,所提模型在三个数据集上的F 1值均有所提升,显著增强了中文医疗短文本的分类性能。 展开更多
关键词 中文医疗数据 短文本分类 语义融合 胶囊网络 注意力池化
在线阅读 下载PDF
数据训练中的版权开放许可规则及其实现路径
20
作者 李倩 沈立苏 《信息安全研究》 北大核心 2026年第1期68-74,共7页
生成式人工智能训练对海量作品的依赖引发版权侵权风险,欧盟、美国与日本等法域通过创新文本与数据挖掘例外等规则予以规制.尽管适当允许利用作品进行数据训练已基本成为国内理论共识,但其具体的合规路径仍存在较大争议.研究发现,应在... 生成式人工智能训练对海量作品的依赖引发版权侵权风险,欧盟、美国与日本等法域通过创新文本与数据挖掘例外等规则予以规制.尽管适当允许利用作品进行数据训练已基本成为国内理论共识,但其具体的合规路径仍存在较大争议.研究发现,应在数据训练中引入版权开放许可机制,以自主声明替代逐件授权,并通过合理利益分配与透明监管体系激励权利人参与,构建权利保护与技术创新的动态平衡.基于作品自动受保护、数量庞杂的特点,应明确版权开放许可声明的公示效力,保护善意第三人的信赖利益,并允许版权人对其系列作品进行集合许可,以更好适应智能时代数据密集型利用的现实需求. 展开更多
关键词 数据训练 文本与数据挖掘 著作权 开放许可 公示效力
在线阅读 下载PDF
上一页 1 2 76 下一页 到第
使用帮助 返回顶部