Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model ha...Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model has been presented for analyzing the natural languages and extracting the relative meanings from the given text. User writes the natural language text in simple English in a few paragraphs and the designed system has a sound ability of analyzing the given script by the user. After composite analysis and extraction of associated information, the designed system gives particular meanings to an assortment of speech language text on the basis of its context. The designed system uses standard speech language rules that are clearly defined for all speech languages as English, Urdu, Chinese, Arabic, French, etc. The designed system provides a quick and reliable way to comprehend speech language context and generate respective meanings.展开更多
Knowlege is important for text-related applications.In this paper,we introduce Microsoft Concept Graph,a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages.Mic...Knowlege is important for text-related applications.In this paper,we introduce Microsoft Concept Graph,a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages.Microsoft Concept Graph is built upon Probase,a universal probabilistic taxonomy consisting of instances and concepts mined from the Web.We start by introducing the construction of the knowledge graph through iterative semantic extraction and taxonomy construction procedures,which extract 2.7 million concepts from 1.68 billion Web pages.We then use conceptualization models to represent text in the concept space to empower text-related applications,such as topic search,query recommendation,Web table understanding and Ads relevance.Since the release in 2016,Microsoft Concept Graph has received more than 100,000 pageviews,2 million API calls and 3,000 registered downloads from 50,000 visitors over 64 countries.展开更多
The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known...The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known as Latinization, also employed for many non-latin alphabet languages. The primary aim of this research is to evaluate the effect of Greeklish on reading time. A sample of 732 young Greeks were asked about their habits when communicating through e-mail and social media with their friends and they then participated in an experiment in which they were asked to read and understand two short texts, one written in Greek and the other in Greeklish. The findings of the research show that nearly one third of the participants use Greeklish. The results of the experiment conducted reveal that understanding is not affected by the alphabet used but reading Greeklish is significantly more time consuming than reading Greek independently of the sex and the familiarity of the participants with Greeklish. The findings suggest that amending social and communication media with software utilities related to Latinization such as language identifiers and converters may reduce reading time and thus facilitate written communication among the users.展开更多
With the explosion of online communication and publication, texts become obtainable via forums, chat messages, blogs, book reviews and movie reviews. Usually, these texts are much short and noisy without sufficient st...With the explosion of online communication and publication, texts become obtainable via forums, chat messages, blogs, book reviews and movie reviews. Usually, these texts are much short and noisy without sufficient statistical signals and enough information for a good semantic analysis. Traditional natural language processing methods such as Bow-of-Word (BOW) based probabilistic latent semantic models fail to achieve high performance due to the short text environment. Recent researches have focused on the correlations between words, i.e., term dependencies, which could be helpful for mining latent semantics hidden in short texts and help people to understand them. Long short-term memory (LSTM) network can capture term dependencies and is able to remember the information for long periods of time. LSTM has been widely used and has obtained promising results in variants of problems of understanding latent semantics of texts. At the same time, by analyzing the texts, we find that a number of keywords contribute greatly to the semantics of the texts. In this paper, we establish a keyword vocabulary and propose an LSTM-based model that is sensitive to the words in the vocabulary; hence, the keywords leverage the semantics of the full document. The proposed model is evaluated in a short-text sentiment analysis task on two datasets: IMDB and SemEval-2016, respectively. Experimental results demonstrate that our model outperforms the baseline LSTM by 1%similar to 2% in terms of accuracy and is effective with significant performance enhancement over several non-recurrent neural network latent semantic models (especially in dealing with short texts). We also incorporate the idea into a variant of LSTM named the gated recurrent unit (GRU) model and achieve good performance, which proves that our method is general enough to improve different deep learning models.展开更多
Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the sema...Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.展开更多
文摘Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model has been presented for analyzing the natural languages and extracting the relative meanings from the given text. User writes the natural language text in simple English in a few paragraphs and the designed system has a sound ability of analyzing the given script by the user. After composite analysis and extraction of associated information, the designed system gives particular meanings to an assortment of speech language text on the basis of its context. The designed system uses standard speech language rules that are clearly defined for all speech languages as English, Urdu, Chinese, Arabic, French, etc. The designed system provides a quick and reliable way to comprehend speech language context and generate respective meanings.
文摘Knowlege is important for text-related applications.In this paper,we introduce Microsoft Concept Graph,a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages.Microsoft Concept Graph is built upon Probase,a universal probabilistic taxonomy consisting of instances and concepts mined from the Web.We start by introducing the construction of the knowledge graph through iterative semantic extraction and taxonomy construction procedures,which extract 2.7 million concepts from 1.68 billion Web pages.We then use conceptualization models to represent text in the concept space to empower text-related applications,such as topic search,query recommendation,Web table understanding and Ads relevance.Since the release in 2016,Microsoft Concept Graph has received more than 100,000 pageviews,2 million API calls and 3,000 registered downloads from 50,000 visitors over 64 countries.
文摘The increasing prevalence of technology in society has an impact on young people’s language use and development. Greeklish is the writing of Greek texts using the Latin instead of the Greek alphabet, a practice known as Latinization, also employed for many non-latin alphabet languages. The primary aim of this research is to evaluate the effect of Greeklish on reading time. A sample of 732 young Greeks were asked about their habits when communicating through e-mail and social media with their friends and they then participated in an experiment in which they were asked to read and understand two short texts, one written in Greek and the other in Greeklish. The findings of the research show that nearly one third of the participants use Greeklish. The results of the experiment conducted reveal that understanding is not affected by the alphabet used but reading Greeklish is significantly more time consuming than reading Greek independently of the sex and the familiarity of the participants with Greeklish. The findings suggest that amending social and communication media with software utilities related to Latinization such as language identifiers and converters may reduce reading time and thus facilitate written communication among the users.
文摘With the explosion of online communication and publication, texts become obtainable via forums, chat messages, blogs, book reviews and movie reviews. Usually, these texts are much short and noisy without sufficient statistical signals and enough information for a good semantic analysis. Traditional natural language processing methods such as Bow-of-Word (BOW) based probabilistic latent semantic models fail to achieve high performance due to the short text environment. Recent researches have focused on the correlations between words, i.e., term dependencies, which could be helpful for mining latent semantics hidden in short texts and help people to understand them. Long short-term memory (LSTM) network can capture term dependencies and is able to remember the information for long periods of time. LSTM has been widely used and has obtained promising results in variants of problems of understanding latent semantics of texts. At the same time, by analyzing the texts, we find that a number of keywords contribute greatly to the semantics of the texts. In this paper, we establish a keyword vocabulary and propose an LSTM-based model that is sensitive to the words in the vocabulary; hence, the keywords leverage the semantics of the full document. The proposed model is evaluated in a short-text sentiment analysis task on two datasets: IMDB and SemEval-2016, respectively. Experimental results demonstrate that our model outperforms the baseline LSTM by 1%similar to 2% in terms of accuracy and is effective with significant performance enhancement over several non-recurrent neural network latent semantic models (especially in dealing with short texts). We also incorporate the idea into a variant of LSTM named the gated recurrent unit (GRU) model and achieve good performance, which proves that our method is general enough to improve different deep learning models.
基金supported by the Foundation of the State Key Laboratory of Software Development Environment(No.SKLSDE-2015ZX-04)
文摘Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.