期刊文献+
共找到1,000篇文章
< 1 2 50 >
每页显示 20 50 100
Semi-Supervised Learning in Large Scale Text Categorization
1
作者 许泽文 李建强 +3 位作者 刘博 毕敬 李蓉 毛睿 《Journal of Shanghai Jiaotong university(Science)》 EI 2017年第3期291-302,共12页
The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately beca... The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization(TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text(FACT)based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine(SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC. 展开更多
关键词 text data mining semi-supervised automatic tagging CLASSIFIER
原文传递
Text categorization based on fuzzy classification rules tree 被引量:2
2
作者 郭玉琴 袁方 刘海博 《Journal of Southeast University(English Edition)》 EI CAS 2008年第3期339-342,共4页
To deal with the problem that arises when the conventional fuzzy class-association method applies repetitive scans of the classifier to classify new texts,which has low efficiency, a new approach based on the FCR-tree... To deal with the problem that arises when the conventional fuzzy class-association method applies repetitive scans of the classifier to classify new texts,which has low efficiency, a new approach based on the FCR-tree(fuzzy classification rules tree)for text categorization is proposed.The compactness of the FCR-tree saves significant space in storing a large set of rules when there are many repeated words in the rules.In comparison with classification rules,the fuzzy classification rules contain not only words,but also the fuzzy sets corresponding to the frequencies of words appearing in texts.Therefore,the construction of an FCR-tree and its structure are different from a CR-tree.To debase the difficulty of FCR-tree construction and rules retrieval,more k-FCR-trees are built.When classifying a new text,it is not necessary to search the paths of the sub-trees led by those words not appearing in this text,thus reducing the number of traveling rules.Experimental results show that the proposed approach obviously outperforms the conventional method in efficiency. 展开更多
关键词 text categorization fuzzy classification association rule classification rules tree fuzzy classification rules tree
在线阅读 下载PDF
A New Approach of Feature Selection for Text Categorization 被引量:6
3
作者 CUI Zifeng XU Baowen +1 位作者 ZHANG Weifeng XU Junling 《Wuhan University Journal of Natural Sciences》 CAS 2006年第5期1335-1339,共5页
This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of e... This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups. 展开更多
关键词 feature selection independency CHI square test text categorization
在线阅读 下载PDF
Comparison of Text Categorization Algorithms 被引量:4
4
作者 SHIYong-feng ZHAOYan-ping 《Wuhan University Journal of Natural Sciences》 EI CAS 2004年第5期798-804,共7页
This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages. It provides clues for making use of appropriate automatic classi... This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages. It provides clues for making use of appropriate automatic classifying algorithms in different fields. Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out. Key words text categorization - naive bayes - KNN - SVM - neural network CLC number TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70031010) and the Research Foundation of Beijing Institute of TechnologyBiography: SHI Yong-feng (1980-), male, Master candidate, research direction: web information mining. 展开更多
关键词 text categorization naive bayes KNN SVM neural network
在线阅读 下载PDF
A Two-Stage Feature Selection Method for Text Categorization by Using Category Correlation Degree and Latent Semantic Indexing 被引量:2
5
作者 王飞 李彩虹 +2 位作者 王景山 徐娇 李廉 《Journal of Shanghai Jiaotong university(Science)》 EI 2015年第1期44-50,共7页
With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(C... With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization. 展开更多
关键词 text categorization feature selection latent semantic indexing(LSI) category correlation degree(CCD)
原文传递
Lazy learner text categorization algorithm based on embedded feature selection 被引量:1
6
作者 Yan Peng Zheng Xuefeng +1 位作者 Zhu Jianyong Xiao Yunhong 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2009年第3期651-659,共9页
To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although havin... To avoid the curse of dimensionality, text categorization (TC) algorithms based on machine learning (ML) have to use an feature selection (FS) method to reduce the dimensionality of feature space. Although having been widely used, FS process will generally cause information losing and then have much side-effect on the whole performance of TC algorithms. On the basis of the sparsity characteristic of text vectors, a new TC algorithm based on lazy feature selection (LFS) is presented. As a new type of embedded feature selection approach, the LFS method can greatly reduce the dimension of features without any information losing, which can improve both efficiency and performance of algorithms greatly. The experiments show the new algorithm can simultaneously achieve much higher both performance and efficiency than some of other classical TC algorithms. 展开更多
关键词 machine learning text categorization embedded feature selection lazy learner cosine similarity.
在线阅读 下载PDF
A Text Categorization System with Soft Real-Time Guarantee 被引量:1
7
作者 WANG Hua-yong CHEN Yu DAI Yi-qi 《Wuhan University Journal of Natural Sciences》 EI CAS 2006年第1期226-229,共4页
In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established... In order to provide predictable runtime performante for text categorization (TC) systems, an innovative system design method is proposed for soft real time TC systems. An analyzable mathematical model is established to approximately describe the nonlinear and time-varying TC systems. According to this mathematical model, the feedback control theory is adopted to prove the system's stableness and zero steady state error. The experiments result shows that the error of deadline satisfied ratio in the system is kept within 4 of the desired value. And the number of classifiers can be dynamically adjusted by the system itself to save the computa tion resources. The proposed methodology enables the theo retical analysis and evaluation to the TC systems, leading to a high-quality and low cost implementation approach. 展开更多
关键词 information retrieval text categorization soft real-time system feedback control theory
在线阅读 下载PDF
A formal study of feature selection in text categorization 被引量:15
8
作者 XU Yan 《通讯和计算机(中英文版)》 2009年第4期32-41,共10页
关键词 特征分类 约束 文本分类 信息
在线阅读 下载PDF
Analysis of Semi-Supervised Text Clustering Algorithm on Marine Data
9
作者 Yu Jiang Dengwen Yu +3 位作者 Mingzhao Zhao Hongtao Bai Chong Wang Lili He 《Computers, Materials & Continua》 SCIE EI 2020年第7期207-216,共10页
Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervise... Semi-supervised clustering improves learning performance as long as it uses a small number of labeled samples to assist un-tagged samples for learning.This paper implements and compares unsupervised and semi-supervised clustering analysis of BOA-Argo ocean text data.Unsupervised K-Means and Affinity Propagation(AP)are two classical clustering algorithms.The Election-AP algorithm is proposed to handle the final cluster number in AP clustering as it has proved to be difficult to control in a suitable range.Semi-supervised samples thermocline data in the BOA-Argo dataset according to the thermocline standard definition,and use this data for semi-supervised cluster analysis.Several semi-supervised clustering algorithms were chosen for comparison of learning performance:Constrained-K-Means,Seeded-K-Means,SAP(Semi-supervised Affinity Propagation),LSAP(Loose Seed AP)and CSAP(Compact Seed AP).In order to adapt the single label,this paper improves the above algorithms to SCKM(improved Constrained-K-Means),SSKM(improved Seeded-K-Means),and SSAP(improved Semi-supervised Affinity Propagationg)to perform semi-supervised clustering analysis on the data.A DSAP(Double Seed AP)semi-supervised clustering algorithm based on compact seeds is proposed as the experimental data shows that DSAP has a better clustering effect.The unsupervised and semi-supervised clustering results are used to analyze the potential patterns of marine data. 展开更多
关键词 Unsupervised learning semi-supervised learning text clustering
在线阅读 下载PDF
The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization
10
作者 Mayy M. Al-Tahrawi 《Journal of Intelligent Learning Systems and Applications》 2013年第2期84-89,共6页
In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy... In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy of PNs-based text categorization, different term reduction criteria as well as different term weighting schemes were experimented on the Reuters Corpus using PNs. Each term weighting scheme on each reduced term set was tested once keeping the rare terms and another time removing them. All the experiments conducted in this research show that keeping rare terms substantially improves the performance of Polynomial Networks in Text Categorization, regardless of the term reduction method, the number of terms used in classification, or the term weighting scheme adopted. 展开更多
关键词 POLYNOMIAL NETWORKS text categorization Document Classification Infrequent TERMS RARE TERMS
暂未订购
Smart Approaches to Efficient Text Mining for Categorizing Sexual Reproductive Health Short Messages into Key Themes
11
作者 Tobias Makai Mayumbo Nyirenda 《Open Journal of Applied Sciences》 2024年第2期511-532,共22页
To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved a... To promote behavioral change among adolescents in Zambia, the National HIV/AIDS/STI/TB Council, in collaboration with UNICEF, developed the Zambia U-Report platform. This platform provides young people with improved access to information on various Sexual Reproductive Health topics through Short Messaging Service (SMS) messages. Over the years, the platform has accumulated millions of incoming and outgoing messages, which need to be categorized into key thematic areas for better tracking of sexual reproductive health knowledge gaps among young people. The current manual categorization process of these text messages is inefficient and time-consuming and this study aims to automate the process for improved analysis using text-mining techniques. Firstly, the study investigates the current text message categorization process and identifies a list of categories adopted by counselors over time which are then used to build and train a categorization model. Secondly, the study presents a proof of concept tool that automates the categorization of U-report messages into key thematic areas using the developed categorization model. Finally, it compares the performance and effectiveness of the developed proof of concept tool against the manual system. The study used a dataset comprising 206,625 text messages. The current process would take roughly 2.82 years to categorise this dataset whereas the trained SVM model would require only 6.4 minutes while achieving an accuracy of 70.4% demonstrating that the automated method is significantly faster, more scalable, and consistent when compared to the current manual categorization. These advantages make the SVM model a more efficient and effective tool for categorizing large unstructured text datasets. These results and the proof-of-concept tool developed demonstrate the potential for enhancing the efficiency and accuracy of message categorization on the Zambia U-report platform and other similar text messages-based platforms. 展开更多
关键词 Knowledge Discovery in text (KDT) Sexual Reproductive Health (SRH) text categorization text Classification text Extraction text Mining Feature Extraction Automated Classification Process Performance Stemming and Lemmatization Natural Language Processing (NLP)
在线阅读 下载PDF
基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类 被引量:9
12
作者 阿力甫.阿不都克里木 李晓 《计算机科学》 CSCD 北大核心 2016年第12期36-40,共5页
针对维吾尔语文本的分类问题,提出一种基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类方法。首先,对输入文本进行预处理,滤除非维吾尔语的字符和停用词;然后,利用词语语义相似度、词语位置和词频重要性加权的TextRank... 针对维吾尔语文本的分类问题,提出一种基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类方法。首先,对输入文本进行预处理,滤除非维吾尔语的字符和停用词;然后,利用词语语义相似度、词语位置和词频重要性加权的TextRank算法提取文本关键词集合;最后,根据互信息相似度度量,计算输入文本关键词集和各类关键词集的相似度,最终实现文本的分类。实验结果表明,该方案能够提取出具有较高识别度的关键词,当关键词集大小为1250时,平均分类率达到了91.2%。 展开更多
关键词 维吾尔语 文本分类 关键词提取 textRank算法 互信息相似度
在线阅读 下载PDF
A Novel Active Learning Method Using SVM for Text Classification 被引量:26
13
作者 Mohamed Goudjil Mouloud Koudil +1 位作者 Mouldi Bedda Noureddine Ghoggali 《International Journal of Automation and computing》 EI CSCD 2018年第3期290-298,共9页
Support vector machines(SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data... Support vector machines(SVMs) are a popular class of supervised learning algorithms, and are particularly applicable to large and high-dimensional classification problems. Like most machine learning methods for data classification and information retrieval, they require manually labeled data samples in the training stage. However, manual labeling is a time consuming and errorprone task. One possible solution to this issue is to exploit the large number of unlabeled samples that are easily accessible via the internet. This paper presents a novel active learning method for text categorization. The main objective of active learning is to reduce the labeling effort, without compromising the accuracy of classification, by intelligently selecting which samples should be labeled.The proposed method selects a batch of informative samples using the posterior probabilities provided by a set of multi-class SVM classifiers, and these samples are then manually labeled by an expert. Experimental results indicate that the proposed active learning method significantly reduces the labeling effort, while simultaneously enhancing the classification accuracy. 展开更多
关键词 text categorization active learning support vector machine (SVM) pool-based active learning pairwise coupling.
原文传递
TextCNN文本分类技术在OA系统中的应用研究 被引量:3
14
作者 皎海军 廖晨阳 +1 位作者 杜胜贤 于劲松 《办公自动化》 2020年第14期45-48,共4页
随着大数据的发展,传统的办公软件迎来新的发展趋势。本文将Text CNN深度学习网络引入政务便民服务的全电子化系统中,研究自然语言处理领域的文本分类技术与协同型OA系统融合的方法,以实现政府公文的分发推荐服务。本着辅助而不干预的原... 随着大数据的发展,传统的办公软件迎来新的发展趋势。本文将Text CNN深度学习网络引入政务便民服务的全电子化系统中,研究自然语言处理领域的文本分类技术与协同型OA系统融合的方法,以实现政府公文的分发推荐服务。本着辅助而不干预的原则,计算机的智能决策结果将清晰地反馈给公文分派员,以辅助其做出最终的判断。该服务解决了政府部门人员短缺,公文分发出错率高的问题有效减少退回率,加快了公文的流转效率。 展开更多
关键词 textCNN 协同型OA 自然语言处理(NLP) 文本分类
在线阅读 下载PDF
A Quantum Spatial Graph Convolutional Network for Text Classification 被引量:3
15
作者 Syed Mustajar Ahmad Shah Hongwei Ge +5 位作者 Sami Ahmed Haider Muhammad Irshad Sohail M.Noman Jehangir Arshad Asfandeyar Ahmad Talha Younas 《Computer Systems Science & Engineering》 SCIE EI 2021年第2期369-382,共14页
The data generated from non-Euclidean domains and its graphical representation(with complex-relationship object interdependence)applications has observed an exponential growth.The sophistication of graph data has pose... The data generated from non-Euclidean domains and its graphical representation(with complex-relationship object interdependence)applications has observed an exponential growth.The sophistication of graph data has posed consequential obstacles to the existing machine learning algorithms.In this study,we have considered a revamped version of a semi-supervised learning algorithm for graph-structured data to address the issue of expanding deep learning approaches to represent the graph data.Additionally,the quantum information theory has been applied through Graph Neural Networks(GNNs)to generate Riemannian metrics in closed-form of several graph layers.In further,to pre-process the adjacency matrix of graphs,a new formulation is established to incorporate high order proximities.The proposed scheme has shown outstanding improvements to overcome the deficiencies in Graph Convolutional Network(GCN),particularly,the information loss and imprecise information representation with acceptable computational overhead.Moreover,the proposed Quantum Graph Convolutional Network(QGCN)has significantly strengthened the GCN on semi-supervised node classification tasks.In parallel,it expands the generalization process with a significant difference by making small random perturbationsG of the graph during the training process.The evaluation results are provided on three benchmark datasets,including Citeseer,Cora,and PubMed,that distinctly delineate the superiority of the proposed model in terms of computational accuracy against state-of-the-art GCN and three other methods based on the same algorithms in the existing literature. 展开更多
关键词 text classification deep learning graph convolutional networks semi-supervised learning GPUS performance improvements
在线阅读 下载PDF
Automatic Classification of Unstructured Blog Text 被引量:1
16
作者 Mita K. Dalal Mukesh A. Zaveri 《Journal of Intelligent Learning Systems and Applications》 2013年第2期108-114,共7页
Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the featur... Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming;statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models—the na?ve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the na?ve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted. 展开更多
关键词 Automatic BLOG text Classification FEATURE Extraction Machine LEARNING Models semi-supervised LEARNING
在线阅读 下载PDF
A fuzzy method to learn text classifier from labeled and unlabeled examples
17
作者 刘宏 黄上腾 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2004年第1期98-102,共5页
In text classification, labeling documents is a tedious and costly task, as it would consume a lot of expert time. On the other hand, it usually is easier to obtain a lot of unlabeled documents, with the help of some ... In text classification, labeling documents is a tedious and costly task, as it would consume a lot of expert time. On the other hand, it usually is easier to obtain a lot of unlabeled documents, with the help of some tools like Digital Library, Crawler Programs, and Searching Engine. To learn text classifier from labeled and unlabeled examples, a novel fuzzy method is proposed. Firstly, a Seeded Fuzzy c-means Clustering algorithm is proposed to learn fuzzy clusters from a set of labeled and unlabeled examples. Secondly, based on the resulting fuzzy clusters, some examples with high confidence are selected to construct training data set. Finally, the constructed training data set is used to train Fuzzy Support Vector Machine, and get text classifier. Empirical results on two benchmark datasets indicate that, by incorporating unlabeled examples into learning process, the method performs significantly better than FSVM trained with a small number of labeled examples only. Also, the method proposed performs at least as well as the related method-EM with Nave Bayes. One advantage of the method proposed is that it does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. 展开更多
关键词 text categorization FUZZY CLUSTERING
在线阅读 下载PDF
基于角度-振幅混合编码的量子神经网络及其应用研究
18
作者 杨帆 程学云 +3 位作者 朱鹏程 姜一博 顾晖 管致锦 《电子科技大学学报》 北大核心 2025年第5期789-800,共12页
传统量子神经网络与自注意机制结合的模型需消耗较高的量子位资源,针对其在当前NISQ设备上运行效率低和设计复杂性高的问题,提出了一种混合编码方式,将数据集特征通过特定的方式嵌入量子态中,从而实现角度编码与振幅编码的有效混合;基... 传统量子神经网络与自注意机制结合的模型需消耗较高的量子位资源,针对其在当前NISQ设备上运行效率低和设计复杂性高的问题,提出了一种混合编码方式,将数据集特征通过特定的方式嵌入量子态中,从而实现角度编码与振幅编码的有效混合;基于该编码方法设计出一种结构独特的双环Ansatz,借鉴自注意机制中的分而治之思想,构建出具备更高表现力的量子神经网络。在鸢尾花分类任务中训练损失值收敛于0,证明模型有效捕捉到鸢尾花特征之间的内在联系;在文本分类任务中与已有方法相比,分类精确度平均提升了8.9%,且在保证效果良好的前提下,成功减少了训练参数的数量。基于角度-振幅混合编码的量子神经网络的轻量化和低复杂度特性使其更适用于当前的NISQ设备。 展开更多
关键词 量子神经网络 混合编码 自注意机制 文本分类
在线阅读 下载PDF
Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization 被引量:3
19
作者 Min DU Xing-shu CHEN 《Journal of Zhejiang University-Science C(Computers and Electronics)》 SCIE EI 2013年第6期407-416,共10页
Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In t... Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets. 展开更多
关键词 k-nearest neighbors(kNN) text categorization Accelerating strategy Principal COMPONENT analysis(PCA)
原文传递
Non-Independent Term Selection for Chinese Text Categorization 被引量:2
20
作者 李景阳 孙茂松 《Tsinghua Science and Technology》 SCIE EI CAS 2009年第1期113-120,共8页
Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classi... Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named "non-independent term selection" is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set. 展开更多
关键词 Chinese text categorization term selection dimentionality
原文传递
上一页 1 2 50 下一页 到第
使用帮助 返回顶部