This paper proposes a new approach for classification for query interfaces of Deep Web, which extracts features from the form's text data on the query interfaces, assisted with the synonym library, and uses radial ba...This paper proposes a new approach for classification for query interfaces of Deep Web, which extracts features from the form's text data on the query interfaces, assisted with the synonym library, and uses radial basic function neural network (RBFNN) algorithm to classify the query interfaces. The applied RBFNN is a kind of effective feed-forward artificial neural network, which has a simple networking structure but features with strength of excellent nonlinear approximation, fast convergence and global convergence. A TEL_8 query interfaces' data set from UIUC on-line database is used in our experiments, which consists of 477 query interfaces in 8 typical domains. Experimental results proved that the proposed approach can efficiently classify the query interfaces with an accuracy of 95.67%.展开更多
Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight v...Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator(URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm(LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.展开更多
Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from quer...Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from query logs for automatic identification of news queries without using any external resources.Design/methodology/approach:First,we manually labeled 1,220 news queries from Sogou.com.Based on the analysis of these queries,we then identified three features of news queries in terms of query content,time of query occurrence and user click behavior.Afterwards,we used 12 effective features proposed in literature as baseline and conducted experiments based on the support vector machine(SVM)classifier.Finally,we compared the impacts of the features used in this paper on the identification of news queries.Findings:Compared with baseline features,the F-score has been improved from 0.6414 to0.8368 after the use of three newly-identified features,among which the burst point(bst)was the most effective while predicting news queries.In addition,query expression(qes)was more useful than query terms,and among the click behavior-based features,news URL was the most effective one.Research limitations:Analyses based on features extracted from query logs might lead to produce limited results.Instead of short queries,the segmentation tool used in this study has been more widely applied for long texts.Practical implications:The research will be helpful for general-purpose search engines to address search intents for news events.Originality/value:Our approach provides a new and different perspective in recognizing queries with news intent without such large news corpora as blogs or Twitter.展开更多
Purpose:In this paper,we attempt to use query refinements to identify users' search intents and seek a method for intent clustering based on real world query data.Design/methodology/approach:An experiment has been...Purpose:In this paper,we attempt to use query refinements to identify users' search intents and seek a method for intent clustering based on real world query data.Design/methodology/approach:An experiment has been conducted to analyze selected search sessions from the American Online(AOL) query logs with a two-stage approach.The first stage is to identify underlying intent by combining query co-occurrence information with query expression similarity.The work in the second stage is to cluster identified results by constructing query vectors through performing random walks on a Markov graph.Findings:Average correctness for identifying search intent is 0.74.Precision,recall,F-score values for intent clustering are 0.73,0.72 and 0.71,respectively.The results indicate that combining session co-occurrence information and query expression similarity can further filter noises and our clustering method is more suitable for sparse data.Research limitations:We use the time-out threshold(15-minutc) method to group queries in one session,but a user may have multiple search goals at the same time and the multi-task behavior of a user is hard to capture in a session defined based on time notions.Practical implications:This study provides insights into the ways of understanding users' search intents by analyzing their queries and refinements from a new perspective.The results will help search engine developers to identify user intents.Originality/value:We propose a new method to identify users' search intents by combining session co-occurrence information and query expression similarity,and a new method for clustering sparse data.展开更多
GML is becoming the de facto standard for electronic data exchange among the applications of Web and distributed geographic information systems. However, the conventional query languages (e. g. SQL and its extended v...GML is becoming the de facto standard for electronic data exchange among the applications of Web and distributed geographic information systems. However, the conventional query languages (e. g. SQL and its extended versions) are not suitable for direct querying and updating of GML documents. Even the effective approaches working well with XML could not guarantee good results when applied to GML documents. Although XQuery is a powerful standard query language for XML, it is not proposed for querying spatial features, which constitute the most important components in GML documents. We propose GQL, a query language specification to support spatial queries over GML documents by extending XQuery. The data model, algebra, and formal semantics as well as various spatial Junctions and operations of GQL are presented in detail.展开更多
While search engines have become vital tools for searching information on the Internet, privacy issues remain a growing concern due to the technological abilities of search engines to retain user search logs. Although...While search engines have become vital tools for searching information on the Internet, privacy issues remain a growing concern due to the technological abilities of search engines to retain user search logs. Although such capabilities might provide enhanced personalized search results, the confidentiality of user intent remains uncertain. Even with web search query obfuscation techniques, another challenge remains, namely, reusing the same obfuscation methods is problematic, given that search engines have enormous computation and storage resources for query disambiguation. A number of web search query privacy procedures involve the cooperation of the search engine, a non-trusted entity in such cases, making query obfuscation even more challenging. In this study, we provide a review on how search engines work in regards to web search queries and user intent. Secondly, this study reviews material in a manner accessible to those outside computer science with the intent to introduce knowledge of web search engines to enable non-computer scientists to approach web search query privacy innovatively. As a contribution, we identify and highlight areas open for further investigative and innovative research in regards to end-user personalized web search privacy—that is methods that can be executed on the user side without third party involvement such as, search engines. The goal is to motivate future web search obfuscation heuristics that give users control over their personal search privacy.展开更多
Cleaning duplicate data is a major problem that persists even though many works have been done to solve it, due to the exponential growth of data amount treated and the necessity to use scalable and speed algorithms. ...Cleaning duplicate data is a major problem that persists even though many works have been done to solve it, due to the exponential growth of data amount treated and the necessity to use scalable and speed algorithms. This problem depends on the type and quality of data, and differs according to the volume of data set manipulated. In this paper we are going to introduce a novel framework based on extended fuzzy C-means algorithm by using topic ontology. This work aims to improve the OLAP querying process over heterogeneous data warehouses that contain big data sets, by improving query results integration, eliminating redundancies by using the extended classification algorithm, and measuring the loss of information.展开更多
Query translation mining is a key technique in cross-language information retrieval and machine translation knowl-edge acquisition. For better performance, the queries are classified into transliterated words and non-...Query translation mining is a key technique in cross-language information retrieval and machine translation knowl-edge acquisition. For better performance, the queries are classified into transliterated words and non-transliterated words based on transliterated word identification model, and are further channeled to different mining processes. This paper is a pilot study on query classification for better translation mining performance, which is based on supervised classification and linguistic heuristics. The person name identification gets a precision of over 97%. Transliterated word translation mining shows satisfactory performance.展开更多
Generally,data is available abundantly in unlabeled form,and its annotation requires some cost.The labeling,as well as learning cost,can be minimized by learning with the minimum labeled data instances.Active learning...Generally,data is available abundantly in unlabeled form,and its annotation requires some cost.The labeling,as well as learning cost,can be minimized by learning with the minimum labeled data instances.Active learning(AL),learns from a few labeled data instances with the additional facility of querying the labels of instances from an expert annotator or oracle.The active learner uses an instance selection strategy for selecting those critical query instances,which reduce the generalization error as fast as possible.This process results in a refined training dataset,which helps in minimizing the overall cost.The key to the success of AL is query strategies that select the candidate query instances and help the learner in learning a valid hypothesis.This survey reviews AL query strategies for classification,regression,and clustering under the pool-based AL scenario.The query strategies under classification are further divided into:informative-based,representative-based,informative-and representative-based,and others.Also,more advanced query strategies based on reinforcement learning and deep learning,along with query strategies under the realistic environment setting,are presented.After a rigorous mathematical analysis of AL strategies,this work presents a comparative analysis of these strategies.Finally,implementation guide,applications,and challenges of AL are discussed.展开更多
Keyword query has attracted much research attention due to its simplicity and wide applications. The inherent ambiguity of keyword query is prone to unsatisfied query results. Moreover some existing techniques on Web ...Keyword query has attracted much research attention due to its simplicity and wide applications. The inherent ambiguity of keyword query is prone to unsatisfied query results. Moreover some existing techniques on Web query, keyword query in relational databases and XML databases cannot be completely applied to keyword query in dataspaces. So we propose KeymanticES, a novel keyword-based semantic entity search mechanism in dataspaces which combines both keyword query and semantic query features. And we focus on query intent disambiguation problem and propose a novel three-step approach to resolve it. Extensive experimental results show the effectiveness and correctness of our proposed approach.展开更多
Authentication reliability of individuals is a demanding service and growing in many areas, not only in the military barracks or police services but also in applications of community and civilian, such as financial tr...Authentication reliability of individuals is a demanding service and growing in many areas, not only in the military barracks or police services but also in applications of community and civilian, such as financial transactions. In this paper, we propose a human verification method depends on extraction a set of retinal features points. Each set of feature points is representing landmarks in the tree of retinal vessel. Extraction and matching of the pattern based on Gabor filters and SVM are described. The validity of the proposed method is verified with experimental results obtained on three different commonly available databases, namely STARE, DRIVE and VARIA. We note that the proposed retinal verification method gives 92.6%, 100% and 98.2% recognition rates for the previous databases, respectively. Furthermore, for the authentication task, the proposed method gives a moderate accuracy of retinal vessel images from these databases.展开更多
Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of mic...Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.展开更多
基金Supported by the National Natural Science Foundation of China(60473045)the Research Plan of Hebei Province(05213573)the Research Plan of Education Office of Hebei Province(2004406).
文摘This paper proposes a new approach for classification for query interfaces of Deep Web, which extracts features from the form's text data on the query interfaces, assisted with the synonym library, and uses radial basic function neural network (RBFNN) algorithm to classify the query interfaces. The applied RBFNN is a kind of effective feed-forward artificial neural network, which has a simple networking structure but features with strength of excellent nonlinear approximation, fast convergence and global convergence. A TEL_8 query interfaces' data set from UIUC on-line database is used in our experiments, which consists of 477 query interfaces in 8 typical domains. Experimental results proved that the proposed approach can efficiently classify the query interfaces with an accuracy of 95.67%.
文摘Web page classification is an important application in many fields of Internet information retrieval,such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator(URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm(LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.
基金supported by the Social Science Planning Foundation of Chongqing(Grant No.:2011QNCB28)
文摘Purpose:Existing researches of predicting queries with news intents have tried to extract the classification features from external knowledge bases,this paper tries to present how to apply features extracted from query logs for automatic identification of news queries without using any external resources.Design/methodology/approach:First,we manually labeled 1,220 news queries from Sogou.com.Based on the analysis of these queries,we then identified three features of news queries in terms of query content,time of query occurrence and user click behavior.Afterwards,we used 12 effective features proposed in literature as baseline and conducted experiments based on the support vector machine(SVM)classifier.Finally,we compared the impacts of the features used in this paper on the identification of news queries.Findings:Compared with baseline features,the F-score has been improved from 0.6414 to0.8368 after the use of three newly-identified features,among which the burst point(bst)was the most effective while predicting news queries.In addition,query expression(qes)was more useful than query terms,and among the click behavior-based features,news URL was the most effective one.Research limitations:Analyses based on features extracted from query logs might lead to produce limited results.Instead of short queries,the segmentation tool used in this study has been more widely applied for long texts.Practical implications:The research will be helpful for general-purpose search engines to address search intents for news events.Originality/value:Our approach provides a new and different perspective in recognizing queries with news intent without such large news corpora as blogs or Twitter.
基金supported by the National Natural Science Foundation of China(Grant No.:71173164)the National Key Technology R&D Program of the Ministry of Science and Technology of China(GrantNo.:2012BAH33F03)
文摘Purpose:In this paper,we attempt to use query refinements to identify users' search intents and seek a method for intent clustering based on real world query data.Design/methodology/approach:An experiment has been conducted to analyze selected search sessions from the American Online(AOL) query logs with a two-stage approach.The first stage is to identify underlying intent by combining query co-occurrence information with query expression similarity.The work in the second stage is to cluster identified results by constructing query vectors through performing random walks on a Markov graph.Findings:Average correctness for identifying search intent is 0.74.Precision,recall,F-score values for intent clustering are 0.73,0.72 and 0.71,respectively.The results indicate that combining session co-occurrence information and query expression similarity can further filter noises and our clustering method is more suitable for sparse data.Research limitations:We use the time-out threshold(15-minutc) method to group queries in one session,but a user may have multiple search goals at the same time and the multi-task behavior of a user is hard to capture in a session defined based on time notions.Practical implications:This study provides insights into the ways of understanding users' search intents by analyzing their queries and refinements from a new perspective.The results will help search engine developers to identify user intents.Originality/value:We propose a new method to identify users' search intents by combining session co-occurrence information and query expression similarity,and a new method for clustering sparse data.
基金Funded by the Youth Chengguang Project of Science and Technology of Wuhan City of China(No.20045006071-16)
文摘GML is becoming the de facto standard for electronic data exchange among the applications of Web and distributed geographic information systems. However, the conventional query languages (e. g. SQL and its extended versions) are not suitable for direct querying and updating of GML documents. Even the effective approaches working well with XML could not guarantee good results when applied to GML documents. Although XQuery is a powerful standard query language for XML, it is not proposed for querying spatial features, which constitute the most important components in GML documents. We propose GQL, a query language specification to support spatial queries over GML documents by extending XQuery. The data model, algebra, and formal semantics as well as various spatial Junctions and operations of GQL are presented in detail.
文摘While search engines have become vital tools for searching information on the Internet, privacy issues remain a growing concern due to the technological abilities of search engines to retain user search logs. Although such capabilities might provide enhanced personalized search results, the confidentiality of user intent remains uncertain. Even with web search query obfuscation techniques, another challenge remains, namely, reusing the same obfuscation methods is problematic, given that search engines have enormous computation and storage resources for query disambiguation. A number of web search query privacy procedures involve the cooperation of the search engine, a non-trusted entity in such cases, making query obfuscation even more challenging. In this study, we provide a review on how search engines work in regards to web search queries and user intent. Secondly, this study reviews material in a manner accessible to those outside computer science with the intent to introduce knowledge of web search engines to enable non-computer scientists to approach web search query privacy innovatively. As a contribution, we identify and highlight areas open for further investigative and innovative research in regards to end-user personalized web search privacy—that is methods that can be executed on the user side without third party involvement such as, search engines. The goal is to motivate future web search obfuscation heuristics that give users control over their personal search privacy.
文摘Cleaning duplicate data is a major problem that persists even though many works have been done to solve it, due to the exponential growth of data amount treated and the necessity to use scalable and speed algorithms. This problem depends on the type and quality of data, and differs according to the volume of data set manipulated. In this paper we are going to introduce a novel framework based on extended fuzzy C-means algorithm by using topic ontology. This work aims to improve the OLAP querying process over heterogeneous data warehouses that contain big data sets, by improving query results integration, eliminating redundancies by using the extended classification algorithm, and measuring the loss of information.
文摘Query translation mining is a key technique in cross-language information retrieval and machine translation knowl-edge acquisition. For better performance, the queries are classified into transliterated words and non-transliterated words based on transliterated word identification model, and are further channeled to different mining processes. This paper is a pilot study on query classification for better translation mining performance, which is based on supervised classification and linguistic heuristics. The person name identification gets a precision of over 97%. Transliterated word translation mining shows satisfactory performance.
文摘Generally,data is available abundantly in unlabeled form,and its annotation requires some cost.The labeling,as well as learning cost,can be minimized by learning with the minimum labeled data instances.Active learning(AL),learns from a few labeled data instances with the additional facility of querying the labels of instances from an expert annotator or oracle.The active learner uses an instance selection strategy for selecting those critical query instances,which reduce the generalization error as fast as possible.This process results in a refined training dataset,which helps in minimizing the overall cost.The key to the success of AL is query strategies that select the candidate query instances and help the learner in learning a valid hypothesis.This survey reviews AL query strategies for classification,regression,and clustering under the pool-based AL scenario.The query strategies under classification are further divided into:informative-based,representative-based,informative-and representative-based,and others.Also,more advanced query strategies based on reinforcement learning and deep learning,along with query strategies under the realistic environment setting,are presented.After a rigorous mathematical analysis of AL strategies,this work presents a comparative analysis of these strategies.Finally,implementation guide,applications,and challenges of AL are discussed.
基金supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201the National Natural Science Foundation of China under Grant Nos. 60973021, 61033007, 61003060the Fundamental Research Funds for the Central Universities of China under Grant No. N100704001
文摘Keyword query has attracted much research attention due to its simplicity and wide applications. The inherent ambiguity of keyword query is prone to unsatisfied query results. Moreover some existing techniques on Web query, keyword query in relational databases and XML databases cannot be completely applied to keyword query in dataspaces. So we propose KeymanticES, a novel keyword-based semantic entity search mechanism in dataspaces which combines both keyword query and semantic query features. And we focus on query intent disambiguation problem and propose a novel three-step approach to resolve it. Extensive experimental results show the effectiveness and correctness of our proposed approach.
文摘Authentication reliability of individuals is a demanding service and growing in many areas, not only in the military barracks or police services but also in applications of community and civilian, such as financial transactions. In this paper, we propose a human verification method depends on extraction a set of retinal features points. Each set of feature points is representing landmarks in the tree of retinal vessel. Extraction and matching of the pattern based on Gabor filters and SVM are described. The validity of the proposed method is verified with experimental results obtained on three different commonly available databases, namely STARE, DRIVE and VARIA. We note that the proposed retinal verification method gives 92.6%, 100% and 98.2% recognition rates for the previous databases, respectively. Furthermore, for the authentication task, the proposed method gives a moderate accuracy of retinal vessel images from these databases.
基金The authors thank the anonymous referees for their useful comments that greatly improved the quality of the paper. This work was supported in part by the National Basic Research Program 973 of China (2012CB316203), the Natural Science Foundation of China (Grant Nos. 61033007, 61272121, 61332014, 61572367, 61332006, 61472321, and 61502390), the National High Technology Research and Development Program 863 of China (2015AA015307), the Fundational Research Funds for the Central Universities (3102015JSJ0011, 3102014JSJ0005, and 3102014JSJ0013), and the Graduate Starting Seed Fund of Northwestern Polytechnical University (Z2012128).
文摘Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.