In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction p...In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.展开更多
This paper analyzed the theory of incremental learning of SVM (support vector machine) and pointed out it is a shortage that the support vector optimization is only considered in present research of SVM incremental le...This paper analyzed the theory of incremental learning of SVM (support vector machine) and pointed out it is a shortage that the support vector optimization is only considered in present research of SVM incremental learning. According to the significance of keyword in training, a new incremental training method considering keyword adjusting was proposed, which eliminates the difference between incremental learning and batch learning through the keyword adjusting. The experimental results show that the improved method outperforms the method without the keyword adjusting and achieve the same precision as the batch method. Key words SVM (support vector machine) - incremental training - classification - keyword adjusting CLC number TP 18 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: SUN Jin-wen (1972-), male, Post-Doctoral, research direction: artificial intelligence, data mining and system integration.展开更多
We present a secure storage system named HermitFS against many types ofattacks. HermitFS uses strong cryptography algorithms and a secure protocol to secure the data fromthe time it is written to the time an authorize...We present a secure storage system named HermitFS against many types ofattacks. HermitFS uses strong cryptography algorithms and a secure protocol to secure the data fromthe time it is written to the time an authorized user accesses it. Our experimental results andsecure analysis show that HermitFS can protect information from unauthorized access in any openenvironment with little penalty of data o-verhead and acceptable performance.展开更多
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, sea...Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine,etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice,the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show:1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization)do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.展开更多
WITS-Math is a mathematical equation formatting tool in WITS, a multilingual document preparation environment. WITS-Math includes a library manager and an equation formatter. The main task of WITS-Math is to format di...WITS-Math is a mathematical equation formatting tool in WITS, a multilingual document preparation environment. WITS-Math includes a library manager and an equation formatter. The main task of WITS-Math is to format diversities of mathematical equations and organize them into an equation library used by other tools in the WITS environment.WITS-Math is a direct manipulation mathematics editor. It uses syntax directed markup language as the internal representation, alld provides an interactive WYSIWYG interface for users to format equations. WITS-Math provides an equation access mechanism. Other tools can access equations in a library by cross-reference from a source file or through data exchange without knowillg the structure of equation libraries. The common data structure and the rendering object in the WITS platform ensure that the formatted equations can be directly used by other tools.展开更多
文摘In this paper, an improved algorithm, named STC-I, is proposed for Chinese Web page clustering based on Chinese language characteristics, which adopts a new unit choice principle and a novel suffix tree construction policy. The experimental results show that the new algorithm keeps advantages of STC, and is better than STC in precision and speed when they are used to cluster Chinese Web page. Key words clustering - suffix tree - Web mining CLC number TP 311 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: YANG Jian-wu (1973-), male, Ph. D, research direction: information retrieval and text mining.
文摘This paper analyzed the theory of incremental learning of SVM (support vector machine) and pointed out it is a shortage that the support vector optimization is only considered in present research of SVM incremental learning. According to the significance of keyword in training, a new incremental training method considering keyword adjusting was proposed, which eliminates the difference between incremental learning and batch learning through the keyword adjusting. The experimental results show that the improved method outperforms the method without the keyword adjusting and achieve the same precision as the batch method. Key words SVM (support vector machine) - incremental training - classification - keyword adjusting CLC number TP 18 Foundation item: Supported by the National Information Industry Development Foundation of ChinaBiography: SUN Jin-wen (1972-), male, Post-Doctoral, research direction: artificial intelligence, data mining and system integration.
基金Supported by the National High Tech Researchand Development Plan of China(2001AA114141)
文摘We present a secure storage system named HermitFS against many types ofattacks. HermitFS uses strong cryptography algorithms and a secure protocol to secure the data fromthe time it is written to the time an authorized user accesses it. Our experimental results andsecure analysis show that HermitFS can protect information from unauthorized access in any openenvironment with little penalty of data o-verhead and acceptable performance.
文摘Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine,etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice,the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show:1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization)do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.
文摘WITS-Math is a mathematical equation formatting tool in WITS, a multilingual document preparation environment. WITS-Math includes a library manager and an equation formatter. The main task of WITS-Math is to format diversities of mathematical equations and organize them into an equation library used by other tools in the WITS environment.WITS-Math is a direct manipulation mathematics editor. It uses syntax directed markup language as the internal representation, alld provides an interactive WYSIWYG interface for users to format equations. WITS-Math provides an equation access mechanism. Other tools can access equations in a library by cross-reference from a source file or through data exchange without knowillg the structure of equation libraries. The common data structure and the rendering object in the WITS platform ensure that the formatted equations can be directly used by other tools.