期刊文献+
共找到1,470篇文章
< 1 2 74 >
每页显示 20 50 100
An alert-situation text data augmentation method based on MLM
1
作者 DING Weijie MAO Tingyun +3 位作者 CHEN Lili ZHOU Mingwei YUAN Ying HU Wentao 《High Technology Letters》 EI CAS 2024年第4期389-396,共8页
The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,i... The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method. 展开更多
关键词 deep learning text data augmentation masked language model(MLM) alert-sit-uation text classification
在线阅读 下载PDF
Quantitative Comparative Study of the Performance of Lossless Compression Methods Based on a Text Data Model
2
作者 Namogo Silué Sié Ouattara +1 位作者 Mouhamadou Dosso Alain Clément 《Open Journal of Applied Sciences》 2024年第7期1944-1962,共19页
Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their perform... Data compression plays a key role in optimizing the use of memory storage space and also reducing latency in data transmission. In this paper, we are interested in lossless compression techniques because their performance is exploited with lossy compression techniques for images and videos generally using a mixed approach. To achieve our intended objective, which is to study the performance of lossless compression methods, we first carried out a literature review, a summary of which enabled us to select the most relevant, namely the following: arithmetic coding, LZW, Tunstall’s algorithm, RLE, BWT, Huffman coding and Shannon-Fano. Secondly, we designed a purposive text dataset with a repeating pattern in order to test the behavior and effectiveness of the selected compression techniques. Thirdly, we designed the compression algorithms and developed the programs (scripts) in Matlab in order to test their performance. Finally, following the tests conducted on relevant data that we constructed according to a deliberate model, the results show that these methods presented in order of performance are very satisfactory:- LZW- Arithmetic coding- Tunstall algorithm- BWT + RLELikewise, it appears that on the one hand, the performance of certain techniques relative to others is strongly linked to the sequencing and/or recurrence of symbols that make up the message, and on the other hand, to the cumulative time of encoding and decoding. 展开更多
关键词 Arithmetic Coding BWT Compression Ratio Comparative Study Compression Techniques Shannon-Fano HUFFMAN Lossless Compression LZW PERFORMANCE REDUNDANCY RLE text data Tunstall
在线阅读 下载PDF
Clustering Text Data Streams 被引量:7
3
作者 刘玉葆 蔡嘉荣 +1 位作者 印鉴 傅蔚慈 《Journal of Computer Science & Technology》 SCIE EI CSCD 2008年第1期112-128,共17页
Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. Howe... Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF,IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF,IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique. 展开更多
关键词 CLUSTERING database applications data mining text data streams
原文传递
Identifying Scientific Project-generated Data Citation from Full-text Articles: An Investigation of TCGA Data Citation 被引量:4
4
作者 Jiao Li Si Zheng +2 位作者 Hongyu Kang Zhen Hou Qing Qian 《Journal of Data and Information Science》 2016年第2期32-44,共13页
Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library arc... Purpose: In the open science era, it is typical to share project-generated scientific data by depositing it in an open and accessible database. Moreover, scientific publications are preserved in a digital library archive. It is challenging to identify the data usage that is mentioned in literature and associate it with its source. Here, we investigated the data usage of a government-funded cancer genomics project, The Cancer Genome Atlas(TCGA), via a full-text literature analysis.Design/methodology/approach: We focused on identifying articles using the TCGA dataset and constructing linkages between the articles and the specific TCGA dataset. First, we collected 5,372 TCGA-related articles from Pub Med Central(PMC). Second, we constructed a benchmark set with 25 full-text articles that truly used the TCGA data in their studies, and we summarized the key features of the benchmark set. Third, the key features were applied to the remaining PMC full-text articles that were collected from PMC.Findings: The amount of publications that use TCGA data has increased significantly since 2011, although the TCGA project was launched in 2005. Additionally, we found that the critical areas of focus in the studies that use the TCGA data were glioblastoma multiforme, lung cancer, and breast cancer; meanwhile, data from the RNA-sequencing(RNA-seq) platform is the most preferable for use.Research limitations: The current workflow to identify articles that truly used TCGA data is labor-intensive. An automatic method is expected to improve the performance.Practical implications: This study will help cancer genomics researchers determine the latest advancements in cancer molecular therapy, and it will promote data sharing and data-intensive scientific discovery.Originality/value: Few studies have been conducted to investigate data usage by governmentfunded projects/programs since their launch. In this preliminary study, we extracted articles that use TCGA data from PMC, and we created a link between the full-text articles and the source data. 展开更多
关键词 Scientific data Full-text literature Open access PubMed Central data citation
在线阅读 下载PDF
Automatic User Goals Identification Based on Anchor Text and Click-Through Data 被引量:6
5
作者 YUAN Xiaojie DOU Zhicheng ZHANG Lu LIU Fang 《Wuhan University Journal of Natural Sciences》 CAS 2008年第4期495-500,共6页
Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to th... Understanding the underlying goal behind a user's Web query has been proved to be helpful to improve the quality of search. This paper focuses on the problem of automatic identification of query types according to the goals. Four novel entropy-based features extracted from anchor data and click-through data are proposed, and a support vector machines (SVM) classifier is used to identify the user's goal based on these features. Experi- mental results show that the proposed entropy-based features are more effective than those reported in previous work. By combin- ing multiple features the goals for more than 97% of the queries studied can be correctly identified. Besides these, this paper reaches the following important conclusions: First, anchor-based features are more effective than click-through-based features; Second, the number of sites is more reliable than the number of links; Third, click-distribution- based features are more effective than session-based ones. 展开更多
关键词 query classification user goals anchor text click-through data information retrieval
在线阅读 下载PDF
A Complexity Analysis and Entropy for Different Data Compression Algorithms on Text Files 被引量:1
6
作者 Mohammad Hjouj Btoush Ziad E. Dawahdeh 《Journal of Computer and Communications》 2018年第1期301-315,共15页
In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorith... In this paper, we analyze the complexity and entropy of different methods of data compression algorithms: LZW, Huffman, Fixed-length code (FLC), and Huffman after using Fixed-length code (HFLC). We test those algorithms on different files of different sizes and then conclude that: LZW is the best one in all compression scales that we tested especially on the large files, then Huffman, HFLC, and FLC, respectively. Data compression still is an important topic for research these days, and has many applications and uses needed. Therefore, we suggest continuing searching in this field and trying to combine two techniques in order to reach a best one, or use another source mapping (Hamming) like embedding a linear array into a Hypercube with other good techniques like Huffman and trying to reach good results. 展开更多
关键词 text FILES data Compression HUFFMAN Coding LZW Hamming ENTROPY COMPLEXITY
暂未订购
A feature representation method for biomedical scientific data based on composite text description
7
作者 SUN Wei 《Chinese Journal of Library and Information Science》 2009年第4期43-53,共11页
Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Ther... Feature representation is one of the key issues in data clustering. The existing feature representation of scientific data is not sufficient, which to some extent affects the result of scientific data clustering. Therefore, the paper proposes a concept of composite text description(CTD) and a CTD-based feature representation method for biomedical scientific data. The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively, combines and finally strengthens the two feature sets. Experiments show that comparing with traditional methods, the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering. 展开更多
关键词 Composite text description Scientific data Feature representation Weight algorism
原文传递
Fast Data Processing of a Polarimeter-Interferometer System on J-TEXT
8
作者 刘煜锴 高丽 +3 位作者 刘海庆 杨曜 高翔 J-TEXT Team 《Plasma Science and Technology》 SCIE EI CAS CSCD 2016年第12期1143-1147,共5页
A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversio... A method of fast data processing has been developed to rapidly obtain evolution of the electron density profile for a multichannel polarimeter-interferometer system(POLARIS)on J-TEXT. Compared with the Abel inversion method, evolution of the density profile analyzed by this method can quickly offer important information. This method has the advantage of fast calculation speed with the order of ten milliseconds per normal shot and it is capable of processing up to 1 MHz sampled data, which is helpful for studying density sawtooth instability and the disruption between shots. In the duration of a flat-top plasma current of usual ohmic discharges on J-TEXT, shape factor u is ranged from 4 to 5. When the disruption of discharge happens, the density profile becomes peaked and the shape factor u typically decreases to 1. 展开更多
关键词 fast data processing polarimeter-interferometer J-text
在线阅读 下载PDF
EOS Data Dumper——EOS免费数据自动下载与重发布系统 被引量:5
9
作者 南卓铜 王亮绪 李新 《冰川冻土》 CSCD 北大核心 2007年第3期463-469,共7页
为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web... 为了更有效的利用已有数据资源,不造成科研设施的重复投资,数据共享越来越受到重视.NASA对地观测系统(EOS)提供了大量的包括MODIS在内的免费数据资源,为此,EOS Data Dumper(EDD)通过程序模拟EOS数据门户的正常下载流程,采用了先进的Web页面文本信息捕捉技术,实现定时自动下载研究区的全部EOS免费数据,并通过免费的DIAL系统,向互联网重新发布,实现复杂的基于时空的数据查询.从技术角度详细介绍了EDD的项目背景与意义、实现方案。 展开更多
关键词 EOS数据 遥感影像数据 文本信息捕捉 数据共享
在线阅读 下载PDF
Semi-Supervised Learning in Large Scale Text Categorization 被引量:1
10
作者 许泽文 李建强 +3 位作者 刘博 毕敬 李蓉 毛睿 《Journal of Shanghai Jiaotong university(Science)》 EI 2017年第3期291-302,共12页
The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately beca... The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization(TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text(FACT)based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine(SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC. 展开更多
关键词 text data mining SEMI-SUPERVISED automatic tagging CLASSIFIER
原文传递
融合多级语义的中文医疗短文本分类模型
11
作者 杨杰 刘纳 +2 位作者 郑国风 李晨 道路 《郑州大学学报(理学版)》 北大核心 2026年第1期51-57,共7页
针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语... 针对医疗短文本分类中关键语义信息提取不足与模型鲁棒性下降的问题,提出了融合多级语义信息的文本分类模型。首先,利用预训练模型捕获文本的初步语义特征。其次,通过胶囊网络提取关键语义信息,确保模型能够有效学习到短文本中的核心语义;采用注意力池化技术聚焦文本中的文档级信息,增强对医学专业术语和概念的识别与理解。最后,引入对抗训练策略,提升模型在面对模糊表达或扰动输入时的稳定性和准确性。在CHIP-CTC、KUAKE_QIC和VSQ三个医疗文本分类数据集上验证了模型的有效性,结果表明,相较于现有模型,所提模型在三个数据集上的F 1值均有所提升,显著增强了中文医疗短文本的分类性能。 展开更多
关键词 中文医疗数据 短文本分类 语义融合 胶囊网络 注意力池化
在线阅读 下载PDF
A Comparative Study on Two Techniques of Reducing the Dimension of Text Feature Space
12
作者 Yin Zhonghang, Wang Yongcheng, Cai Wei & Diao Qian School of Electronic & Information Technology, Shanghai Jiaotong University, Shanghai 200030, P.R.China 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2002年第1期87-92,共6页
With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension... With the development of large scale text processing, the dimension of text feature space has become larger and larger, which has added a lot of difficulties to natural language processing. How to reduce the dimension has become a practical problem in the field. Here we present two clustering methods, i.e. concept association and concept abstract, to achieve the goal. The first refers to the keyword clustering based on the co occurrence of 展开更多
关键词 in the same text and the second refers to that in the same category. Then we compare the difference between them. Our experiment results show that they are efficient to reduce the dimension of text feature space. Keywords: text data mining
在线阅读 下载PDF
基于数据挖掘分析古今医家治疗糖尿病肾脏疾病用药规律
13
作者 张梦莹 徐浩凡 +2 位作者 王文佳 周水平 胡蕴慧 《新中医》 2026年第1期8-17,共10页
目的:基于数据挖掘分析古代医家治疗糖尿病肾脏疾病的用药规律,并基于真实世界研究与网络邻近度分析对用药规律进行验证。方法:借助第五版《中华医典》检索内服药物治疗糖尿病肾脏疾病的中药方剂,通过中医传承辅助平台统计处方中药物的... 目的:基于数据挖掘分析古代医家治疗糖尿病肾脏疾病的用药规律,并基于真实世界研究与网络邻近度分析对用药规律进行验证。方法:借助第五版《中华医典》检索内服药物治疗糖尿病肾脏疾病的中药方剂,通过中医传承辅助平台统计处方中药物的性味归经、使用频率,分析药物关联规则、核心药物组合和高频中药组合。利用真实世界数据及网络邻近度分析分别对古籍中药特性、使用频率和核心药物组合进行古今对比的初步验证。结果:《中华医典》中纳入治疗糖尿病肾脏疾病方剂376首,涉及中药310味。药性最多为寒性药,次之为温性药;药味最多为甘味药,次之为苦味药;归经主要归肾经。单味中药使用频次前5的是麦冬、人参、茯苓、甘草、瓜蒌。关联规则分析得到支持度≥20的124个药物组合,以及置信度≥0.9的15组药物组合。在真实世界结果中共纳入治疗糖尿病肾脏疾病患者30972人次,涉及中药444味。用药药性以寒性最多,温性次之;药味以苦味最多,甘味次之;归经以肾经为主,肝经次之。古籍中用药频次前5的中药在真实世界研究中的排名总占比都小于5%。聚类分析得到6个核心药物组合,且每组核心药物与糖尿病肾脏疾病计算的Z-score都小于-4。结论:糖尿病肾脏疾病的病因以肝肾亏虚为主,脾、肺、心失调为辅,古代医家在治疗糖尿病肾脏疾病时,不仅注重补虚以扶正,还常配合清热泻火,佐以利水渗湿及收敛摄精。在方剂化裁方面,尤为重视金匮肾气丸或六味地黄丸的运用。 展开更多
关键词 糖尿病肾脏疾病 中医古籍 数据挖掘 用药规律 真实世界研究 网络邻近度
原文传递
浅谈如何使用SQL中的image和text数据 被引量:1
14
作者 陈晓男 《电脑知识与技术》 2006年第5期123-124,共2页
SQL中的image和text类型的数据带给用户很多便利。但具体使用时常常会遇到许多问题,那幺该如何解决呢,我们可以用两个命令提示待下的命令bcp和textcopy来解决。
关键词 SQL 数据 IMAGE text命令 bcp textcopy
在线阅读 下载PDF
中文科技政策文本分类:增强的TextCNN视角 被引量:8
15
作者 李牧南 王良 赖华鹏 《科技管理研究》 CSSCI 北大核心 2023年第2期160-166,共7页
近年尽管针对中文本文分类的研究成果不少,但基于深度学习对中文政策等长文本进行自动分类的研究还不多见。为此,借鉴和拓展传统的数据增强方法,提出集成新时代人民日报分词语料库(NEPD)、简单数据增强(EDA)算法、word2vec和文本卷积神... 近年尽管针对中文本文分类的研究成果不少,但基于深度学习对中文政策等长文本进行自动分类的研究还不多见。为此,借鉴和拓展传统的数据增强方法,提出集成新时代人民日报分词语料库(NEPD)、简单数据增强(EDA)算法、word2vec和文本卷积神经网络(TextCNN)的NEWT新型计算框架;实证部分,基于中国地方政府发布的科技政策文本进行算法校验。实验结果显示,在取词长度分别为500、750和1000词的情况下,应用NEWT算法对中文科技政策文本进行分类的效果优于RCNN、Bi-LSTM和CapsNet等传统深度学习模型,F1值的平均提升比例超过13%;同时,NEWT在较短取词长度下能够实现全文输入的近似效果,可以部分改善传统深度学习模型在中文长文本自动分类任务中的计算效率。 展开更多
关键词 NEWT 深度学习 数据增强 卷积神经网络 政策文本分类 中文长文本
在线阅读 下载PDF
J-TEXT托卡马克数据采集系统设计 被引量:2
16
作者 黄礼华 庄革 +1 位作者 张明 杨州军 《微计算机信息》 2009年第16期74-76,共3页
为了满足J-TEXT装置实验数据采集、存储、访问和处理的需要,本文设计开发了J-TEXT装置数据采集及服务系统。该系统采用C/S(客户机/服务器)模式构建,包括数据采集、数据存储和数据服务三部分,它们之间通过高速以太网进行通信连接。本文... 为了满足J-TEXT装置实验数据采集、存储、访问和处理的需要,本文设计开发了J-TEXT装置数据采集及服务系统。该系统采用C/S(客户机/服务器)模式构建,包括数据采集、数据存储和数据服务三部分,它们之间通过高速以太网进行通信连接。本文详细介绍了系统的网络结构、数据采集流程和相关的数据服务。在J-TEXT装置的放电实验中,数据采集及服务系统性能稳定,运行可靠,访问方式灵活,满足了J-TEXT装置的需要。 展开更多
关键词 J-text 数据采集 MDSPLUS
在线阅读 下载PDF
J-TEXT托卡马克的数据采集和数据服务系统 被引量:1
17
作者 冯泽龙 庄革 +3 位作者 瞿连政 丁永华 张明 黄礼华 《船电技术》 2007年第2期65-68,共4页
J-TEXT托卡马克是用来研究热核聚变的实验装置。数据系统是其重要组成部分,由数据采集和数据服务组成。数据采集部分选用PCI总线采集卡,采集程序用LabVIEW编写,运行在Windows操作系统之下。数据服务采用MDSplus软件包,选用Redhat Linux ... J-TEXT托卡马克是用来研究热核聚变的实验装置。数据系统是其重要组成部分,由数据采集和数据服务组成。数据采集部分选用PCI总线采集卡,采集程序用LabVIEW编写,运行在Windows操作系统之下。数据服务采用MDSplus软件包,选用Redhat Linux Enterprise为操作系统。本文阐述了该系统的数据流程与实现过程。 展开更多
关键词 MDSPLUS LABVIEW J-text数据采集数据服务
在线阅读 下载PDF
基于fastText的地震信息文本分类方法 被引量:1
18
作者 王钟浩 崔珂玮 +2 位作者 张鑫 杨振中 刘帅 《现代信息科技》 2021年第3期5-8,共4页
针对地震发生后新闻种类繁多,无法准确获取地震相关新闻的问题,该文提出了一种通过互联网获取地震信息并对地震信息进行文本信息识别的方法,可以识别文本信息是否为地震信息。采用Python爬虫技术对结构不同的新闻网站进行数据采集,并基... 针对地震发生后新闻种类繁多,无法准确获取地震相关新闻的问题,该文提出了一种通过互联网获取地震信息并对地震信息进行文本信息识别的方法,可以识别文本信息是否为地震信息。采用Python爬虫技术对结构不同的新闻网站进行数据采集,并基于fastText的文本分类模型对数据进行分类训练,实验结果表明:该方法能够有效地对新闻进行分类,获取所需地震新闻。 展开更多
关键词 深度学习 文本分类 数据采集 自然语言处理
在线阅读 下载PDF
A New Feature Selection Method for Text Clustering 被引量:3
19
作者 XU Junling XU Baowen +2 位作者 ZHANG Weifeng CUI Zifeng ZHANG Wei 《Wuhan University Journal of Natural Sciences》 CAS 2007年第5期912-916,共5页
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method... Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, a new feature selection method for text clustering based on expectation maximization and cluster validity is proposed. It uses supervised feature selection method on the intermediate clustering result which is generated during iterative clustering to do feature selection for text clustering; meanwhile, the Davies-Bouldin's index is used to evaluate the intermediate feature subsets indirectly. Then feature subsets are selected according to the curve of the Davies-Bouldin's index. Experiment is carried out on several popular datasets and the results show the advantages of the proposed method. 展开更多
关键词 feature selection text clustering unsupervised learning data preprocessing
在线阅读 下载PDF
一种基于日志信息和CNN-text的软件系统异常检测方法 被引量:41
20
作者 梅御东 陈旭 +4 位作者 孙毓忠 牛逸翔 肖立 王海荣 冯百明 《计算机学报》 EI CSCD 北大核心 2020年第2期366-380,共15页
当前,数据挖掘作为一种高时效性、高真实性的分析方法,正在社会中扮演着越发重要的角色,其在大型数据中快速挖掘模式,发现规律的能力正逐步取代人工的作用.而在当前各个计算机领域大行其道的大型分布式系统(如Hadoop、Spark等)的日志中... 当前,数据挖掘作为一种高时效性、高真实性的分析方法,正在社会中扮演着越发重要的角色,其在大型数据中快速挖掘模式,发现规律的能力正逐步取代人工的作用.而在当前各个计算机领域大行其道的大型分布式系统(如Hadoop、Spark等)的日志中,每天都产生着数以百万计的系统日志,这些日志的数据量之庞杂、关系之混乱,已大大影响了程序员对系统的人工监控效率,同时也提高了新程序员的培养成本.为解决以上问题,数据挖掘及系统分析两个领域相结合是一种必然的趋势,也因此,机器学习模型也越来越多地被业界提及用于做系统日志分析.然而大多数情况下,系统日志中,报告系统运行状态为“严重”的日志占少数,而这些少数信息才是程序员最需要关注的,然而大多数用于系统日志分析的机器学习模型都假设训练集的数据是均衡数据,因此这些模型在做系统日志预警时容易过度偏向大样本数据,以至于效果不够理想.本文将从深度学习角度出发,探究深度学习中的CNN-text(CT)在系统日志分析方面的应用能力,通过将CT与主流的系统日志分析机器学习模型SVM、决策树对比,探究CT相对于这些算法的优越性;将CT与CNN-RNN-text(CRT)进行对比,分析CT对特征的处理方式,证实CT在深度学习模型中处理系统日志类文本的优越性;最后将所有模型应用至两套不同的日志类文本数据中进行对比,证明CT的普适性.在CT同日志分析的主流机器学习模型对比的实验中,CT相较于最优模型的结果召回率提升了近15%;在CT同CRT模型对比的实验中,CT相较于更为先进的CRT,模型准确率高出约20%,召回率高出约80%、查准率高出约60%;在CT的普适性实验中,将各类模型融入到本文的实验数据集logstash和公开数据集WC85_1中,在准确率同其他表现较优的模型同为100%的情况下,CT的召回率高出其余召回率最高的模型(DT-Bi)近14%.从中可看出,相较于主流系统日志分析机器学习模型,如支持向量机、决策树、朴素贝叶斯等,CNN-text的局部特征提取能力及非线性拟合能力都有更为优异的表现;同时相较于同为深度学习CNN簇的CNN-RNN-text将大量权重投入到系统日志的序列特征中的特点,CNN-text则报以较少的关注,反而在序列不规则的系统日志中展现出比CNN-RNN-text更优秀的表现.最终证明了CNN-text是本文所提到的方法中最适合进行软件系统异常检测的方法. 展开更多
关键词 系统日志分析 系统异常预警 不均衡数据 机器学习 深度学习 CNN-text
在线阅读 下载PDF
上一页 1 2 74 下一页 到第
使用帮助 返回顶部