To alleviate the amount of work involved in constructing a domain ontology, starting with the base of an existing terminological-rich thesaurus is better than starting from scratch. With a case study of reengineering ...To alleviate the amount of work involved in constructing a domain ontology, starting with the base of an existing terminological-rich thesaurus is better than starting from scratch. With a case study of reengineering the Defense Science and Technology Thesaurus into a prototype military aircraft ontology, a four-phase thesaurus-based methodology is introduced and investigated, which consists of identifying the application purpose, overall design, designing in detail and evaluation. Designing in detail is the core step, converting the terms and semantic relationships of the thesaurus into an ontology and supplementing richer semantic relationships. The resulting prototype ontology includes 87 concepts and 34 relationships, and can be extended and scaled up to a full-fledged domain ontology in the future. Eight universal genres of relationships of this ontology are preliminarily summarized and analyzed, including equivalent relationships, approximate relationships, generic/abstract relationships, part/whole relationships, cause/effect relationships, entity/location relationships etc., and the normalization of semantic relationships is critical to the merging and reusing of follow-up multiple ontologies.展开更多
There are deep-rooted traditions of researches on the youth problems in Russia, By their trends and purposes they partly concur with the traditions of the humanities in Europe and America. In Russia in different times...There are deep-rooted traditions of researches on the youth problems in Russia, By their trends and purposes they partly concur with the traditions of the humanities in Europe and America. In Russia in different timesmit was the same way in the West-diverse youth concepts had been conveying and continue to express the society's expectations for new generations. This is in a sense a theoretical mirror of the natural process of generation change. Under modern conditions these concepts can be reduced to three directions: youth-"no man's land", youth-social danger, youth-hope of society. At the same time youth theories have the mark of the socio-cultural contexts and contexts of the development of the humanities in Russia. In this article these similarities and distinctions will be examined.展开更多
Thesaurus retrieval is fundamental in Chinese information processing.After a brief review of the current technique,this pa-per made a deep analysis to the design of Chinese thesaurus Hash function based on chain addre...Thesaurus retrieval is fundamental in Chinese information processing.After a brief review of the current technique,this pa-per made a deep analysis to the design of Chinese thesaurus Hash function based on chain address conflict dissolving method,and several criteria,as well as the theoretic expectation of these criteria,were proposed to evaluate different Hash functions.According these values,some experimental Hash functions were proposed which had high efficiency in our test.展开更多
Based on the statistics of 130 thesauri having been published in China so far,this article analyzes the development of thesauri in China from the origin,publication year,academic disciplines and quantity of entries co...Based on the statistics of 130 thesauri having been published in China so far,this article analyzes the development of thesauri in China from the origin,publication year,academic disciplines and quantity of entries collected,and re-defines the development stages.In addition,by collecting the 1,000 relevant research papers,the article also analyzes the theoretical studies from the aspects of the quantity of papers and research subjects in order to give a clear picture of the development and features of the researches on thesaurus in China.展开更多
To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entitie...To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.展开更多
Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been c...Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naive Bayes Multinomial.展开更多
We present and analyze an unsupervised method for Word Sense Disambiguation(WSD).Our work is based on the method presented by McCarthy et al.in 2004 for finding the predominant sense of each word in the entire corpu...We present and analyze an unsupervised method for Word Sense Disambiguation(WSD).Our work is based on the method presented by McCarthy et al.in 2004 for finding the predominant sense of each word in the entire corpus.Their maximization algorithm allows weighted terms(similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense,i.e.,the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word.This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus.In the method of McCarthy et al.,every occurrence of the ambiguous word uses the same thesaurus,regardless of the context where the ambiguous word occurs.Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word.We obtain a top precision of 77.54%of accuracy versus 67.10%of the original method tested on SemCor.We also analyze the effect of the number of weighted terms in the tasks of finding the Most Precuent Sense(MFS) and WSD,and experiment with several corpora for building the Word Space Model.展开更多
Purpose:We present an analytical,open source and flexible natural language processing and text mining method for topic evolution,emerging topic detection and research trend forecasting for all kinds of data-tagged tex...Purpose:We present an analytical,open source and flexible natural language processing and text mining method for topic evolution,emerging topic detection and research trend forecasting for all kinds of data-tagged text.Design/methodology/approach:We make full use of the functions provided by the open source VOSviewer and Microsoft Office,including a thesaurus for data clean-up and a LOOKUP function for comparative analysis.Findings:Through application and verification in the domain of perovskite solar cells research,this method proves to be effective.Research limitations:A certain amount of manual data processing and a specific research domain background are required for better,more illustrative analysis results.Adequate time for analysis is also necessary.Practical implications:We try to set up an easy,useful,and flexible interdisciplinary text analyzing procedure for researchers,especially those without solid computer programming skills or who cannot easily access complex software.This procedure can also serve as a wonderful example for teaching information literacy.Originality/value:This text analysis approach has not been reported before.展开更多
上海师范大学教授、博士生导师潘悟云的学术研究主要有以下几个方向:一、汉语史与东亚语言历史比较,在多方面有突破性的成果。在《汉语历史音韵学》一书中全面提出了上古音系统,在国内讨论中引起了轰动,随即在国际上也引起反响。专...上海师范大学教授、博士生导师潘悟云的学术研究主要有以下几个方向:一、汉语史与东亚语言历史比较,在多方面有突破性的成果。在《汉语历史音韵学》一书中全面提出了上古音系统,在国内讨论中引起了轰动,随即在国际上也引起反响。专家认为此书代表了中国音韵学的主流。欧洲最大的汉语史国际研究项目“Thesaurus Linguae Sericae”决定采用这个系统,并邀请潘悟云两次到欧洲进行宣讲。展开更多
This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a l...This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet;and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.展开更多
《朗曼英语词典》(Longman Dictionary of the English Language,下文简称LDOEL)是继英国朗曼公司推出的一系列著名词典,如约翰生《英语词典》(A Dictionary of the English 1775)、《罗瑞词林》(Roget’s Thesaurus,1853)、...《朗曼英语词典》(Longman Dictionary of the English Language,下文简称LDOEL)是继英国朗曼公司推出的一系列著名词典,如约翰生《英语词典》(A Dictionary of the English 1775)、《罗瑞词林》(Roget’s Thesaurus,1853)、《朗曼现代英语词典》(Longman Modern English Dictionary,1968)和《朗曼当代英语词典》(Dictionary of Contemporary English,1978)之后的又一力作。展开更多
文摘To alleviate the amount of work involved in constructing a domain ontology, starting with the base of an existing terminological-rich thesaurus is better than starting from scratch. With a case study of reengineering the Defense Science and Technology Thesaurus into a prototype military aircraft ontology, a four-phase thesaurus-based methodology is introduced and investigated, which consists of identifying the application purpose, overall design, designing in detail and evaluation. Designing in detail is the core step, converting the terms and semantic relationships of the thesaurus into an ontology and supplementing richer semantic relationships. The resulting prototype ontology includes 87 concepts and 34 relationships, and can be extended and scaled up to a full-fledged domain ontology in the future. Eight universal genres of relationships of this ontology are preliminarily summarized and analyzed, including equivalent relationships, approximate relationships, generic/abstract relationships, part/whole relationships, cause/effect relationships, entity/location relationships etc., and the normalization of semantic relationships is critical to the merging and reusing of follow-up multiple ontologies.
文摘There are deep-rooted traditions of researches on the youth problems in Russia, By their trends and purposes they partly concur with the traditions of the humanities in Europe and America. In Russia in different timesmit was the same way in the West-diverse youth concepts had been conveying and continue to express the society's expectations for new generations. This is in a sense a theoretical mirror of the natural process of generation change. Under modern conditions these concepts can be reduced to three directions: youth-"no man's land", youth-social danger, youth-hope of society. At the same time youth theories have the mark of the socio-cultural contexts and contexts of the development of the humanities in Russia. In this article these similarities and distinctions will be examined.
文摘Thesaurus retrieval is fundamental in Chinese information processing.After a brief review of the current technique,this pa-per made a deep analysis to the design of Chinese thesaurus Hash function based on chain address conflict dissolving method,and several criteria,as well as the theoretic expectation of these criteria,were proposed to evaluate different Hash functions.According these values,some experimental Hash functions were proposed which had high efficiency in our test.
基金supported by the National Nature Science Foundation of China(Grant No.70573103)
文摘Based on the statistics of 130 thesauri having been published in China so far,this article analyzes the development of thesauri in China from the origin,publication year,academic disciplines and quantity of entries collected,and re-defines the development stages.In addition,by collecting the 1,000 relevant research papers,the article also analyzes the theoretical studies from the aspects of the quantity of papers and research subjects in order to give a clear picture of the development and features of the researches on thesaurus in China.
文摘To address the underutilization of Chinese research materials in nonferrous metals,a method for constructing a domain of nonferrous metals knowledge graph(DNMKG)was established.Starting from a domain thesaurus,entities and relationships were mapped as resource description framework(RDF)triples to form the graph’s framework.Properties and related entities were extracted from open knowledge bases,enriching the graph.A large-scale,multi-source heterogeneous corpus of over 1×10^(9) words was compiled from recent literature to further expand DNMKG.Using the knowledge graph as prior knowledge,natural language processing techniques were applied to the corpus,generating word vectors.A novel entity evaluation algorithm was used to identify and extract real domain entities,which were added to DNMKG.A prototype system was developed to visualize the knowledge graph and support human−computer interaction.Results demonstrate that DNMKG can enhance knowledge discovery and improve research efficiency in the nonferrous metals field.
基金Project (No. 20111081023) supported by the Tsinghua University Initiative Scientific Research Program, China
文摘Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naive Bayes Multinomial.
基金Supported by the Mexican Government(SNI,SIP-IPN,COFAA-IPN,and PIFI-IPN),CONACYT and the Japanese Government.
文摘We present and analyze an unsupervised method for Word Sense Disambiguation(WSD).Our work is based on the method presented by McCarthy et al.in 2004 for finding the predominant sense of each word in the entire corpus.Their maximization algorithm allows weighted terms(similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense,i.e.,the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word.This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus.In the method of McCarthy et al.,every occurrence of the ambiguous word uses the same thesaurus,regardless of the context where the ambiguous word occurs.Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word.We obtain a top precision of 77.54%of accuracy versus 67.10%of the original method tested on SemCor.We also analyze the effect of the number of weighted terms in the tasks of finding the Most Precuent Sense(MFS) and WSD,and experiment with several corpora for building the Word Space Model.
文摘Purpose:We present an analytical,open source and flexible natural language processing and text mining method for topic evolution,emerging topic detection and research trend forecasting for all kinds of data-tagged text.Design/methodology/approach:We make full use of the functions provided by the open source VOSviewer and Microsoft Office,including a thesaurus for data clean-up and a LOOKUP function for comparative analysis.Findings:Through application and verification in the domain of perovskite solar cells research,this method proves to be effective.Research limitations:A certain amount of manual data processing and a specific research domain background are required for better,more illustrative analysis results.Adequate time for analysis is also necessary.Practical implications:We try to set up an easy,useful,and flexible interdisciplinary text analyzing procedure for researchers,especially those without solid computer programming skills or who cannot easily access complex software.This procedure can also serve as a wonderful example for teaching information literacy.Originality/value:This text analysis approach has not been reported before.
文摘上海师范大学教授、博士生导师潘悟云的学术研究主要有以下几个方向:一、汉语史与东亚语言历史比较,在多方面有突破性的成果。在《汉语历史音韵学》一书中全面提出了上古音系统,在国内讨论中引起了轰动,随即在国际上也引起反响。专家认为此书代表了中国音韵学的主流。欧洲最大的汉语史国际研究项目“Thesaurus Linguae Sericae”决定采用这个系统,并邀请潘悟云两次到欧洲进行宣讲。
文摘This paper investigates a procedure developed and reports on experiments performed to studying the utility of applying a combined structural property of a text’s sentences and term expansion using WordNet [1] and a local thesaurus [2] in the selection of the most appropriate extractive text summarization for a particular document. Sentences were tagged and normalized then subjected to the Longest Common Subsequence (LCS) algorithm [3] [4] for the selection of the most similar subset of sentences. Calculated similarity was based on LCS of pairs of sentences that make up the document. A normalized score was calculated and used to rank sentences. A selected top subset of the most similar sentences was then tokenized to produce a set of important keywords or terms. The produced terms were further expanded into two subsets using 1) WorldNet;and 2) a local electronic dictionary/thesaurus. The three sets obtained (the original and the expanded two) were then re-cycled to further refine and expand the list of selected sentences from the original document. The process was repeated a number of times in order to find the best representative set of sentences. A final set of the top (best) sentences was selected as candidate sentences for summarization. In order to verify the utility of the procedure, a number of experiments were conducted using an email corpus. The results were compared to those produced by human annotators as well as to results produced using some basic sentences similarity calculation method. Produced results were very encouraging and compared well to those of human annotators and Jacquard sentences similarity.
文摘《朗曼英语词典》(Longman Dictionary of the English Language,下文简称LDOEL)是继英国朗曼公司推出的一系列著名词典,如约翰生《英语词典》(A Dictionary of the English 1775)、《罗瑞词林》(Roget’s Thesaurus,1853)、《朗曼现代英语词典》(Longman Modern English Dictionary,1968)和《朗曼当代英语词典》(Dictionary of Contemporary English,1978)之后的又一力作。