This paper analyzes the global competitive landscape of smartphone technological innovation capacity using the latent semantic indexing(LSI)and the vector space model(VSM).It integrates the theory of technological eco...This paper analyzes the global competitive landscape of smartphone technological innovation capacity using the latent semantic indexing(LSI)and the vector space model(VSM).It integrates the theory of technological ecological niches and evaluates four key dimensions:patent quality,energy efficiency engineering,technological modules,and intelligent computing power.The findings reveal that USA has established strong technological barriers through standard-essential patents(SEPs)in wireless communication and integrated circuits.In contrast,Chinese mainland firms rely heavily on fundamental technologies.Qualcomm Inc.in USA and Taiwan Semiconductor Manufacturing Company(TSMC)in Chineses Taiwan have built a comprehensive patent porfolio in energy efficiency engineering.While Chinese mainland faces challenges in advancing dynamic frequency modulation algorithms and high-end manufacturing processes.Huawei Inc.in Chinese mainland leads in 5G module technology but struggles with ecosystem collaboration.Semiconductor manufacturing and radio frequency(RF)components still rely on external suppliers.This highlights the urgent need for innovation in new materials and open'source architectures.To enhance intelligent computing power,Chinese mainland firms must address coordination challenges.They should adopt scenario-driven technological strategies and build a comprehensive ecosystem that includes hardware,operating systems,and developer networks.展开更多
This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schem...This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.展开更多
One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse ...One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.展开更多
We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user inter...We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user interest tree, the content and the structure of which can change simultaneously to adapt to the changes in a user's interests. This expression represents a user's specific and general interests as a continuurn. In some sense, specific interests correspond to shortterm interests, while general interests correspond to longterm interests. So this representation more really reflects the users' interests. The algorithm can automatically model a us er's multiple interest domains, dynamically generate the in terest models and prune a user interest tree when the number of the nodes in it exceeds given value. Finally, we show the experiment results in a Chinese Web Site.展开更多
A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of key...A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.展开更多
In this paper, an improved algorithm, web-based keyword weight algorithm (WKWA), is presented to weight keywords in web documents. WKWA takes into account representation features of web documents and advantages of t...In this paper, an improved algorithm, web-based keyword weight algorithm (WKWA), is presented to weight keywords in web documents. WKWA takes into account representation features of web documents and advantages of the TF*IDF, TFC and ITC algorithms in order to make it more appropriate for web documents. Meanwhile, the presented algorithm is applied to improved vector space model (IVSM). A real system has been implemented for calculating semantic similarities of web documents. Four experiments have been carried out. They are keyword weight calculation, feature item selection, semantic similarity calculation, and WKWA time performance. The results demonstrate accuracy of keyword weight, and semantic similarity is improved.展开更多
Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft...Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.展开更多
Problems with data security impede the widespread application of cloud computing. Although data can be protected through encryption, effective retrieval of encrypted data is difficult to achieve using traditional meth...Problems with data security impede the widespread application of cloud computing. Although data can be protected through encryption, effective retrieval of encrypted data is difficult to achieve using traditional methods. This paper analyzes encrypted storage and retrieval technologies in cloud storage applications. A ranking method based on fully homomorphic encryption is proposed to meet demands of encrypted storage. Results show this method can improve efficiency.展开更多
This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential c...This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.展开更多
What is a real time agent,how does it remedy ongoing daily frustrations for users,and how does it improve the retrieval performance in World Wide Web?These are the main question we focus on this manuscript.In many dis...What is a real time agent,how does it remedy ongoing daily frustrations for users,and how does it improve the retrieval performance in World Wide Web?These are the main question we focus on this manuscript.In many distributed information retrieval systems,information in agents should be ranked based on a combination of multiple criteria.Linear combination of ranks has been the dominant approach due to its simplicity and effectiveness.Such a combination scheme in distributed infrastructure requires that the ranks in resources or agents are comparable to each other before combined.The main challenge is transforming the raw rank values of different criteria appropriately to make them comparable before any combination.Different ways for ranking agents make this strategy difficult.In this research,we will demonstrate how to rank Web documents based on resource-provided information how to combine several resources raking schemas in one time.The proposed system was implemented specifically in data provided by agents to create a comparable combination for different attributes.The proposed approach was tested on the queries provided by Text Retrieval Conference(TREC).Experimental results showed that our approach is effective and robust compared with offline search platforms.展开更多
Data warehouse (DW) modeling is a complicated task, involving both knowledge of business processes and familiarity with operational information systems structure and behavior. Existing DW modeling techniques suffer ...Data warehouse (DW) modeling is a complicated task, involving both knowledge of business processes and familiarity with operational information systems structure and behavior. Existing DW modeling techniques suffer from the following major drawbacks -- data-driven approach requires high levels of expertise and neglects the requirements of end users, while demand-driven approach lacks enterprise-wide vision and is regardless of existing models of underlying operational systems. In order to make up for those shortcomings, a method of classification of schema elements for DW modeling is proposed in this paper. We first put forward the vector space models for subjects and schema elements, then present an adaptive approach with self-tuning theory to construct context vectors of subjects, and finally classify the source schema elements into different subjects of the DW automatically. Benefited from the result of the schema elements classification, designers can model and construct a DW more easily.展开更多
This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-bas...This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it's difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text content so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.展开更多
基金supported in part by the National Social Science Foundation of China(No.20BGL203).
文摘This paper analyzes the global competitive landscape of smartphone technological innovation capacity using the latent semantic indexing(LSI)and the vector space model(VSM).It integrates the theory of technological ecological niches and evaluates four key dimensions:patent quality,energy efficiency engineering,technological modules,and intelligent computing power.The findings reveal that USA has established strong technological barriers through standard-essential patents(SEPs)in wireless communication and integrated circuits.In contrast,Chinese mainland firms rely heavily on fundamental technologies.Qualcomm Inc.in USA and Taiwan Semiconductor Manufacturing Company(TSMC)in Chineses Taiwan have built a comprehensive patent porfolio in energy efficiency engineering.While Chinese mainland faces challenges in advancing dynamic frequency modulation algorithms and high-end manufacturing processes.Huawei Inc.in Chinese mainland leads in 5G module technology but struggles with ecosystem collaboration.Semiconductor manufacturing and radio frequency(RF)components still rely on external suppliers.This highlights the urgent need for innovation in new materials and open'source architectures.To enhance intelligent computing power,Chinese mainland firms must address coordination challenges.They should adopt scenario-driven technological strategies and build a comprehensive ecosystem that includes hardware,operating systems,and developer networks.
文摘This study introduces the Orbit Weighting Scheme(OWS),a novel approach aimed at enhancing the precision and efficiency of Vector Space information retrieval(IR)models,which have traditionally relied on weighting schemes like tf-idf and BM25.These conventional methods often struggle with accurately capturing document relevance,leading to inefficiencies in both retrieval performance and index size management.OWS proposes a dynamic weighting mechanism that evaluates the significance of terms based on their orbital position within the vector space,emphasizing term relationships and distribution patterns overlooked by existing models.Our research focuses on evaluating OWS’s impact on model accuracy using Information Retrieval metrics like Recall,Precision,InterpolatedAverage Precision(IAP),andMeanAverage Precision(MAP).Additionally,we assessOWS’s effectiveness in reducing the inverted index size,crucial for model efficiency.We compare OWS-based retrieval models against others using different schemes,including tf-idf variations and BM25Delta.Results reveal OWS’s superiority,achieving a 54%Recall and 81%MAP,and a notable 38%reduction in the inverted index size.This highlights OWS’s potential in optimizing retrieval processes and underscores the need for further research in this underrepresented area to fully leverage OWS’s capabilities in information retrieval methodologies.
文摘One of the critical hurdles, and breakthroughs, in the field of Natural Language Processing (NLP) in the last two decades has been the development of techniques for text representation that solves the so-called curse of dimensionality, a problem which plagues NLP in general given that the feature set for learning starts as a function of the size of the language in question, upwards of hundreds of thousands of terms typically. As such, much of the research and development in NLP in the last two decades has been in finding and optimizing solutions to this problem, to feature selection in NLP effectively. This paper looks at the development of these various techniques, leveraging a variety of statistical methods which rest on linguistic theories that were advanced in the middle of the last century, namely the distributional hypothesis which suggests that words that are found in similar contexts generally have similar meanings. In this survey paper we look at the development of some of the most popular of these techniques from a mathematical as well as data structure perspective, from Latent Semantic Analysis to Vector Space Models to their more modern variants which are typically referred to as word embeddings. In this review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea of semantic spaces more generally beyond applicability to NLP.
基金Supported by the National Natural Science Funda-tion of China (69973012 ,60273080)
文摘We propose an algorithm for learning hierarchical user interest models according to the Web pages users have browsed. In this algorithm, the interests of a user are represented into a tree which is called a user interest tree, the content and the structure of which can change simultaneously to adapt to the changes in a user's interests. This expression represents a user's specific and general interests as a continuurn. In some sense, specific interests correspond to shortterm interests, while general interests correspond to longterm interests. So this representation more really reflects the users' interests. The algorithm can automatically model a us er's multiple interest domains, dynamically generate the in terest models and prune a user interest tree when the number of the nodes in it exceeds given value. Finally, we show the experiment results in a Chinese Web Site.
文摘A hybrid model that is based on the Combination of keywords and concept was put forward. The hybrid model is built on vector space model and probabilistic reasoning network. It not only can exert the advantages of keywords retrieval and concept retrieval but also can compensate for their shortcomings. Their parameters can be adjusted according to different usage in order to accept the best information retrieval result, and it has been proved by our experiments.
基金Project supported by the Science Foundation of Shanghai Municipal Commission of Science and Technology (Grant No.055115001)
文摘In this paper, an improved algorithm, web-based keyword weight algorithm (WKWA), is presented to weight keywords in web documents. WKWA takes into account representation features of web documents and advantages of the TF*IDF, TFC and ITC algorithms in order to make it more appropriate for web documents. Meanwhile, the presented algorithm is applied to improved vector space model (IVSM). A real system has been implemented for calculating semantic similarities of web documents. Four experiments have been carried out. They are keyword weight calculation, feature item selection, semantic similarity calculation, and WKWA time performance. The results demonstrate accuracy of keyword weight, and semantic similarity is improved.
基金This work is supported by Technical Education Quality Program-TEQIP III.The project is implemented by NPIU,which is a unit of MHRD,Govt of India for implementation of world bank assisted projects in Technical Education.
文摘Purpose-Natural languages have a fundamental quality of suppleness that makes it possible to present a single idea in plenty of different ways.This feature is often exploited in the academic world,leading to the theft of work referred to as plagiarism.Many approaches have been put forward to detect such cases based on various text features and grammatical structures of languages.However,there is a huge scope of improvement for detecting intelligent plagiarism.Design/methodology/approach-To realize this,the paper introduces a hybrid model to detect intelligent plagiarism by breaking the entire process into three stages:(1)clustering,(2)vector formulation in each cluster based on semantic roles,normalization and similarity index calculation and(3)Summary generation using encoder-decoder.An effective weighing scheme has been introduced to select terms used to build vectors based on K-means,which is calculated on the synonym set for the said term.If the value calculated in the last stage lies above a predefined threshold,only then the next semantic argument is analyzed.When the similarity score for two documents is beyond the threshold,a short summary for plagiarized documents is created.Findings-Experimental results show that this method is able to detect connotation and concealment used in idea plagiarism besides detecting literal plagiarism.Originality/value-The proposed model can help academics stay updated by providing summaries of relevant articles.It would eliminate the practice of plagiarism infesting the academic community at an unprecedented pace.The model will also accelerate the process of reviewing academic documents,aiding in the speedy publishing of research articles.
基金funded by the National Key Technology R & D Program of China under Grant No. 2008BAH37B07the National Natural Science Foundation of China under Grant No. 60970148the National Basic Research Program of China ("973" Program) under Grant No. 2007CB310806
文摘Problems with data security impede the widespread application of cloud computing. Although data can be protected through encryption, effective retrieval of encrypted data is difficult to achieve using traditional methods. This paper analyzes encrypted storage and retrieval technologies in cloud storage applications. A ranking method based on fully homomorphic encryption is proposed to meet demands of encrypted storage. Results show this method can improve efficiency.
基金Science and Technology Development Project of Tianjin(No. 06FZRJGX02400)National Natural Science Foundation of China (No.60603027)
文摘This paper focuses on document clustering by clustering algorithm based on a DEnsityTree (CABDET) to improve the accuracy of clustering. The CABDET method constructs a density-based treestructure for every potential cluster by dynamically adjusting the radius of neighborhood according to local density. It avoids density-based spatial clustering of applications with noise (DBSCAN) ′s global density parameters and reduces input parameters to one. The results of experiment on real document show that CABDET achieves better accuracy of clustering than DBSCAN method. The CABDET algorithm obtains the max F-measure value 0.347 with the root node's radius of neighborhood 0.80, which is higher than 0.332 of DBSCAN with the radius of neighborhood 0.65 and the minimum number of objects 6.
基金This research was developed at the University of Ottawa as part of“SAMA”search enginea.
文摘What is a real time agent,how does it remedy ongoing daily frustrations for users,and how does it improve the retrieval performance in World Wide Web?These are the main question we focus on this manuscript.In many distributed information retrieval systems,information in agents should be ranked based on a combination of multiple criteria.Linear combination of ranks has been the dominant approach due to its simplicity and effectiveness.Such a combination scheme in distributed infrastructure requires that the ranks in resources or agents are comparable to each other before combined.The main challenge is transforming the raw rank values of different criteria appropriately to make them comparable before any combination.Different ways for ranking agents make this strategy difficult.In this research,we will demonstrate how to rank Web documents based on resource-provided information how to combine several resources raking schemas in one time.The proposed system was implemented specifically in data provided by agents to create a comparable combination for different attributes.The proposed approach was tested on the queries provided by Text Retrieval Conference(TREC).Experimental results showed that our approach is effective and robust compared with offline search platforms.
基金Supported by the National Natural Science Foundation of China under Grant No. 60403041, the Project of National "10th Five-Year Plan" of China under Grant No. 2001BA102A01, and the National Grand Fundamental Research 973 Program of China under Grant No. 1999032705. Acknowledgements The first author is very thankful to Jian Liu, Bo Yu, Peng Zhang and Ling-Fu Li for their valuable suggestions on this paper. The authors are also grateful to anonymous reviewers for their insightful comments on this paper, which have helped to improve the manuscript.
文摘Data warehouse (DW) modeling is a complicated task, involving both knowledge of business processes and familiarity with operational information systems structure and behavior. Existing DW modeling techniques suffer from the following major drawbacks -- data-driven approach requires high levels of expertise and neglects the requirements of end users, while demand-driven approach lacks enterprise-wide vision and is regardless of existing models of underlying operational systems. In order to make up for those shortcomings, a method of classification of schema elements for DW modeling is proposed in this paper. We first put forward the vector space models for subjects and schema elements, then present an adaptive approach with self-tuning theory to construct context vectors of subjects, and finally classify the source schema elements into different subjects of the DW automatically. Benefited from the result of the schema elements classification, designers can model and construct a DW more easily.
文摘This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it's difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text content so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.