In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction p...In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.展开更多
Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Ukkonen algorithm is deeply investigated and a new algorithm, which...Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Ukkonen algorithm is deeply investigated and a new algorithm, which decreases the number of memory operations in construction and keeps the result tree sequential, is proposed. The experiment result shows that both the construction and the matching procedure are more efficient than Ukkonen algorithm.展开更多
Classical algorithms and data structures assume that the underlying memory is reliable,and the data remain safe during or after processing.However,the assumption is perilous as several studies have shown that large an...Classical algorithms and data structures assume that the underlying memory is reliable,and the data remain safe during or after processing.However,the assumption is perilous as several studies have shown that large and inexpensive memories are vulnerable to bit flips.Thus,the correctness of output of a classical algorithm can be threatened by a few memory faults.Fault tolerant data structures and resilient algorithms are developed to tolerate a limited number of faults and provide a correct output based on the uncorrupted part of the data.Suffix tree is one of the important data structures that has widespread applications including substring search,super string problem and data compression.The fault tolerant version of the suffix tree presented in the literature uses complex techniques of encodable and decodable error-correcting codes,blocked data structures and fault-resistant tries.In this work,we use the natural approach of data replication to develop a fault tolerant suffix tree based on the faulty memory random access machine model.The proposed data structure stores copies of the indices to sustain memory faults injected by an adversary.We develop a resilient version of the Ukkonen’s algorithm for constructing the fault tolerant suffix tree and derive an upper bound on the number of corrupt suffixes.展开更多
Hu Shuhe gets a sufficient condition on the law of the iterated logarithm for the sums of φ-mixing sequences with duple suffixes. This paper greatly improves his condition.
In the research on the Chinese temporal system,Chen(1988)proposed the ternary structure of Chinese temporal system.Based on the ternary structure of the Chinese temporal system,many researches have focused on discussi...In the research on the Chinese temporal system,Chen(1988)proposed the ternary structure of Chinese temporal system.Based on the ternary structure of the Chinese temporal system,many researches have focused on discussing Chinese aspect system.Compared with the research on aspect,there are fewer studies on Chinese verbal situations,such as Ma(1981),Deng(1985),Dai(1997),etc.,which are all based on Vendler’s(1967)four categories of verbal situations.And compared with verbal situation,there are fewer studies on phase.Most researchers believe that phase and verbal situations are the same concept.However,this article believes that in the study of Chinese temporal system,we should first distinguish between phase and verbal situations,and then compare with aspect.Based on the distinction between phase and verbal situations,this article combines the situation and verbal aspect suffix“LE”,which is also an aspect marker,and tries to sum up the relationship between the situation and“LE”.展开更多
In the higher vocational college,most students encounter great difficulties in dealing with English suffixes.This paper analyses the differences between English and Chinese morphology,especially suffixes that denote t...In the higher vocational college,most students encounter great difficulties in dealing with English suffixes.This paper analyses the differences between English and Chinese morphology,especially suffixes that denote tenses,numbers and comparisons and offer recommendations for English as a second language(ESL)classroom.展开更多
全文索引技术(full-text index technique)作为提高全文检索时空效率的有效方式之一,近年来得到了广泛而深入的研究.根据全文索引实现技术的不同,将其分为三大类:索引技术、压缩与索引混合技术以及自索引技术(self-index technique).从...全文索引技术(full-text index technique)作为提高全文检索时空效率的有效方式之一,近年来得到了广泛而深入的研究.根据全文索引实现技术的不同,将其分为三大类:索引技术、压缩与索引混合技术以及自索引技术(self-index technique).从上述分类角度综述了全文索引时空效率方法中具有代表性的一些方法和技术:倒排文件、签名文件、后缀树与后缀数组、基于这3种索引的压缩技术、基于倒排文件的自索引与基于后缀数组的自索引的基本原理、所面临的问题及进展,并对这些技术的时空性能进行了详细的分析和比较,分析了各种技术的适应环境及优劣.最后总结了上述技术的特点,指出了存在的问题以及未来的研究方向.展开更多
文摘In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.
基金supported by the National Natural Science Foundation of China(6050203260672068).
文摘Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Ukkonen algorithm is deeply investigated and a new algorithm, which decreases the number of memory operations in construction and keeps the result tree sequential, is proposed. The experiment result shows that both the construction and the matching procedure are more efficient than Ukkonen algorithm.
文摘Classical algorithms and data structures assume that the underlying memory is reliable,and the data remain safe during or after processing.However,the assumption is perilous as several studies have shown that large and inexpensive memories are vulnerable to bit flips.Thus,the correctness of output of a classical algorithm can be threatened by a few memory faults.Fault tolerant data structures and resilient algorithms are developed to tolerate a limited number of faults and provide a correct output based on the uncorrupted part of the data.Suffix tree is one of the important data structures that has widespread applications including substring search,super string problem and data compression.The fault tolerant version of the suffix tree presented in the literature uses complex techniques of encodable and decodable error-correcting codes,blocked data structures and fault-resistant tries.In this work,we use the natural approach of data replication to develop a fault tolerant suffix tree based on the faulty memory random access machine model.The proposed data structure stores copies of the indices to sustain memory faults injected by an adversary.We develop a resilient version of the Ukkonen’s algorithm for constructing the fault tolerant suffix tree and derive an upper bound on the number of corrupt suffixes.
文摘Hu Shuhe gets a sufficient condition on the law of the iterated logarithm for the sums of φ-mixing sequences with duple suffixes. This paper greatly improves his condition.
文摘In the research on the Chinese temporal system,Chen(1988)proposed the ternary structure of Chinese temporal system.Based on the ternary structure of the Chinese temporal system,many researches have focused on discussing Chinese aspect system.Compared with the research on aspect,there are fewer studies on Chinese verbal situations,such as Ma(1981),Deng(1985),Dai(1997),etc.,which are all based on Vendler’s(1967)four categories of verbal situations.And compared with verbal situation,there are fewer studies on phase.Most researchers believe that phase and verbal situations are the same concept.However,this article believes that in the study of Chinese temporal system,we should first distinguish between phase and verbal situations,and then compare with aspect.Based on the distinction between phase and verbal situations,this article combines the situation and verbal aspect suffix“LE”,which is also an aspect marker,and tries to sum up the relationship between the situation and“LE”.
文摘In the higher vocational college,most students encounter great difficulties in dealing with English suffixes.This paper analyses the differences between English and Chinese morphology,especially suffixes that denote tenses,numbers and comparisons and offer recommendations for English as a second language(ESL)classroom.
文摘全文索引技术(full-text index technique)作为提高全文检索时空效率的有效方式之一,近年来得到了广泛而深入的研究.根据全文索引实现技术的不同,将其分为三大类:索引技术、压缩与索引混合技术以及自索引技术(self-index technique).从上述分类角度综述了全文索引时空效率方法中具有代表性的一些方法和技术:倒排文件、签名文件、后缀树与后缀数组、基于这3种索引的压缩技术、基于倒排文件的自索引与基于后缀数组的自索引的基本原理、所面临的问题及进展,并对这些技术的时空性能进行了详细的分析和比较,分析了各种技术的适应环境及优劣.最后总结了上述技术的特点,指出了存在的问题以及未来的研究方向.