Considering the constantly increasing of data in large databases such as wire transfer database, incremental clustering algorithms play a more and more important role in Data Mining (DM). However, Few of the traditi...Considering the constantly increasing of data in large databases such as wire transfer database, incremental clustering algorithms play a more and more important role in Data Mining (DM). However, Few of the traditional clustering algorithms can not only handle the categorical data, but also explain its output clearly. Based on the idea of dynamic clustering, an incremental conceptive clustering algorithm is proposed in this paper. Which introduces the Semantic Core Tree (SCT) to deal with large volume of categorical wire transfer data for the detecting money laundering. In addition, the rule generation algorithm is presented here to express the clustering result by the format of knowledge. When we apply this idea in financial data mining, the efficiency of searching the characters of money laundering data will be improved.展开更多
We propose two models in this paper. The concept of association model is put forward to obtain the co-occurrence relationships among keywords in the documents and the hierarchical Hamming clustering model is used to r...We propose two models in this paper. The concept of association model is put forward to obtain the co-occurrence relationships among keywords in the documents and the hierarchical Hamming clustering model is used to reduce the dimensionality of the category feature vector space which can solve the problem of the extremely high dimensionality of the documents' feature space. The results of experiment indicate that it can obtain the co-occurrence relations among key-words in the documents which promote the recall of classification system effectively. The hierarchical Hamming clustering model can reduce the dimensionality of the category feature vector efficiently, the size of the vector space is only about 10% of the primary dimensionality. Key words text classification - concept association - hierarchical clustering - hamming clustering CLC number TN 915. 08 Foundation item: Supporteded by the National 863 Project of China (2001AA142160, 2002AA145090)Biography: Su Gui-yang (1974-), male, Ph. D candidate, research direction: information filter and text classification.展开更多
Purpose: Formal concept analysis(FCA) and concept lattice theory(CLT) are introduced for constructing a network of IDR topics and for evaluating their effectiveness for knowledge structure exploration.Design/methodolo...Purpose: Formal concept analysis(FCA) and concept lattice theory(CLT) are introduced for constructing a network of IDR topics and for evaluating their effectiveness for knowledge structure exploration.Design/methodology/approach: We introduced the theory and applications of FCA and CLT, and then proposed a method for interdisciplinary knowledge discovery based on CLT. As an example of empirical analysis, interdisciplinary research(IDR) topics in Information & Library Science(LIS) and Medical Informatics, and in LIS and Geography-Physical, were utilized as empirical fields. Subsequently, we carried out a comparative analysis with two other IDR topic recognition methods.Findings: The CLT approach is suitable for IDR topic identification and predictions.Research limitations: IDR topic recognition based on the CLT is not sensitive to the interdisciplinarity of topic terms, since the data can only reflect whether there is a relationship between the discipline and the topic terms. Moreover, the CLT cannot clearly represent a large amounts of concepts.Practical implications: A deeper understanding of the IDR topics was obtained as the structural and hierarchical relationships between them were identified, which can help to get more precise identification and prediction to IDR topics.Originality/value: IDR topics identification based on CLT have performed well and this theory has several advantages for identifying and predicting IDR topics. First, in a concept lattice, there is a partial order relation between interconnected nodes, and consequently, a complete concept lattice can present hierarchical properties. Second, clustering analysis of IDR topics based on concept lattices can yield clusters that highlight the essential knowledge features and help display the semantic relationship between different IDR topics. Furthermore, the Hasse diagram automatically displays all the IDR topics associated with the different disciplines, thus forming clusters of specific concepts and visually retaining and presenting the associations of IDR topics through multiple inheritance relationships between the concepts.展开更多
数据流分类方法研究在开放环境下的模型动态更新,以期从实时到达且不断变化的数据流中检测并适应概念演化,目前多数数据流分类方法通常假设数据流中样本的类别数是固定的,并且样本的标签可以不受限制地获取,这在真实场景下是不现实的。...数据流分类方法研究在开放环境下的模型动态更新,以期从实时到达且不断变化的数据流中检测并适应概念演化,目前多数数据流分类方法通常假设数据流中样本的类别数是固定的,并且样本的标签可以不受限制地获取,这在真实场景下是不现实的。为此,该文提出了一种概念演化数据流主动学习方法(Active Learning Method for Concept Evolution Data Stream,ALM-CEDS)。定义基于样本标准差的基分类器重要性度量,提出基于加权预测概率的样本预测方法,提升分类器的分类性能;提出基于混合标签查询策略的分类器更新方法,使用难区分和代表当前数据分布的样本更新分类器;提出基于微簇q-近邻轮廓系数的新类检测方法,在数据流中快速识别新类。在4个真实数据流与5个合成数据流上的对比实验表明,该概念演化数据流主动学习方法在分类性能上优于已有的6种数据流学习方法。展开更多
传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间...传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间学习建立历史数据与新增数据之间的投影关系,将新增数据投影至历史数据张成的子空间中,以实时获取其聚类划分。其次,为保持不同时刻聚类划分的准确性,对持续到达的数据流进行数据分布的一致性检测,捕获其中存在的概念漂移,并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后,通过在多个真实数据集上进行测试,验证了所提方法在处理高维数据流的效能。所提方法在保持较高聚类性能的同时,能够高效处理数据流中的概念漂移。展开更多
基金Supported by the National Natural Science Foun-dation of China (60403027) the Natural Science Foundation of HubeiProvince (2005ABA258)the Opening Foundation of State KeyLaboratory of Software Engineering (SKLSE05-07)
文摘Considering the constantly increasing of data in large databases such as wire transfer database, incremental clustering algorithms play a more and more important role in Data Mining (DM). However, Few of the traditional clustering algorithms can not only handle the categorical data, but also explain its output clearly. Based on the idea of dynamic clustering, an incremental conceptive clustering algorithm is proposed in this paper. Which introduces the Semantic Core Tree (SCT) to deal with large volume of categorical wire transfer data for the detecting money laundering. In addition, the rule generation algorithm is presented here to express the clustering result by the format of knowledge. When we apply this idea in financial data mining, the efficiency of searching the characters of money laundering data will be improved.
文摘We propose two models in this paper. The concept of association model is put forward to obtain the co-occurrence relationships among keywords in the documents and the hierarchical Hamming clustering model is used to reduce the dimensionality of the category feature vector space which can solve the problem of the extremely high dimensionality of the documents' feature space. The results of experiment indicate that it can obtain the co-occurrence relations among key-words in the documents which promote the recall of classification system effectively. The hierarchical Hamming clustering model can reduce the dimensionality of the category feature vector efficiently, the size of the vector space is only about 10% of the primary dimensionality. Key words text classification - concept association - hierarchical clustering - hamming clustering CLC number TN 915. 08 Foundation item: Supporteded by the National 863 Project of China (2001AA142160, 2002AA145090)Biography: Su Gui-yang (1974-), male, Ph. D candidate, research direction: information filter and text classification.
基金an outcome of the project "Study on the Recognition Method of Innovative Evolving Trajectory based on Topic Correlation Analysis of Science and Technology" (No. 71704170) supported by National Natural Science Foundation of Chinathe project "Study on Regularity and Dynamics of Knowledge Diffusion among Scientific Disciplines" (No. 71704063) supported by National Natura Science Foundation of Chinathe Youth Innovation Promotion Association, CAS (Grant No. 2016159)
文摘Purpose: Formal concept analysis(FCA) and concept lattice theory(CLT) are introduced for constructing a network of IDR topics and for evaluating their effectiveness for knowledge structure exploration.Design/methodology/approach: We introduced the theory and applications of FCA and CLT, and then proposed a method for interdisciplinary knowledge discovery based on CLT. As an example of empirical analysis, interdisciplinary research(IDR) topics in Information & Library Science(LIS) and Medical Informatics, and in LIS and Geography-Physical, were utilized as empirical fields. Subsequently, we carried out a comparative analysis with two other IDR topic recognition methods.Findings: The CLT approach is suitable for IDR topic identification and predictions.Research limitations: IDR topic recognition based on the CLT is not sensitive to the interdisciplinarity of topic terms, since the data can only reflect whether there is a relationship between the discipline and the topic terms. Moreover, the CLT cannot clearly represent a large amounts of concepts.Practical implications: A deeper understanding of the IDR topics was obtained as the structural and hierarchical relationships between them were identified, which can help to get more precise identification and prediction to IDR topics.Originality/value: IDR topics identification based on CLT have performed well and this theory has several advantages for identifying and predicting IDR topics. First, in a concept lattice, there is a partial order relation between interconnected nodes, and consequently, a complete concept lattice can present hierarchical properties. Second, clustering analysis of IDR topics based on concept lattices can yield clusters that highlight the essential knowledge features and help display the semantic relationship between different IDR topics. Furthermore, the Hasse diagram automatically displays all the IDR topics associated with the different disciplines, thus forming clusters of specific concepts and visually retaining and presenting the associations of IDR topics through multiple inheritance relationships between the concepts.
文摘数据流分类方法研究在开放环境下的模型动态更新,以期从实时到达且不断变化的数据流中检测并适应概念演化,目前多数数据流分类方法通常假设数据流中样本的类别数是固定的,并且样本的标签可以不受限制地获取,这在真实场景下是不现实的。为此,该文提出了一种概念演化数据流主动学习方法(Active Learning Method for Concept Evolution Data Stream,ALM-CEDS)。定义基于样本标准差的基分类器重要性度量,提出基于加权预测概率的样本预测方法,提升分类器的分类性能;提出基于混合标签查询策略的分类器更新方法,使用难区分和代表当前数据分布的样本更新分类器;提出基于微簇q-近邻轮廓系数的新类检测方法,在数据流中快速识别新类。在4个真实数据流与5个合成数据流上的对比实验表明,该概念演化数据流主动学习方法在分类性能上优于已有的6种数据流学习方法。
文摘传统数据流聚类方法缺乏对高维数据的在线降维能力,导致其聚类性能受限。为解决此问题,提出了一种基于可扩展子空间学习的数据流聚类方法(Scalable Subspace Learning for Clustering Data Streams,S2LCStream)。首先,通过可扩展子空间学习建立历史数据与新增数据之间的投影关系,将新增数据投影至历史数据张成的子空间中,以实时获取其聚类划分。其次,为保持不同时刻聚类划分的准确性,对持续到达的数据流进行数据分布的一致性检测,捕获其中存在的概念漂移,并结合回溯机制对聚类划分进行调整以适应动态变化的数据分布。最后,通过在多个真实数据集上进行测试,验证了所提方法在处理高维数据流的效能。所提方法在保持较高聚类性能的同时,能够高效处理数据流中的概念漂移。