With the rapid development of web3.0 applications,the volume of data sharing is increasing,the inefficiency of big data file sharing and the problem of data privacy leakage are becoming more and more prominent,and the...With the rapid development of web3.0 applications,the volume of data sharing is increasing,the inefficiency of big data file sharing and the problem of data privacy leakage are becoming more and more prominent,and the existing data sharing schemes have been difficult to meet the growing demand for data sharing,this paper aims at exploring a secure,efficient and privacy-protecting data sharing scheme under web3.0 applications.Specifically,this paper adopts interplanetary file system(IPFS)technology to realize the storage of large data files to solve the problem of blockchain storage capacity limitation,and utilizes ciphertext policy attribute-based encryption(CP-ABE)and proxy re-encryption(PRE)technology to realize secure multi-party sharing and finegrained access control of data.This paper provides the detailed algorithm design and implementation of data sharing phases and processes,and analyzes the algorithms from the perspectives of security,privacy protection,and performance.展开更多
The World Wide Web provides a wealth of information about everything, including contemporary audio and visual art events, which are discussed on media outlets, blogs, and specialized websites alike. This information m...The World Wide Web provides a wealth of information about everything, including contemporary audio and visual art events, which are discussed on media outlets, blogs, and specialized websites alike. This information may become a robust source of real-world data, which may form the basis of an objective data-driven analysis. In this study, a methodology for collecting information about audio and visual art events in an automated manner from a large array of websites is presented in detail. This process uses cutting edge Semantic Web, Web Search and Generative AI technologies to convert website documents into a collection of structured data. The value of the methodology is demonstrated by creating a large dataset concerning audiovisual events in Greece. The collected information includes event characteristics, estimated metrics based on their text descriptions, outreach metrics based on the media that reported them, and a multi-layered classification of these events based on their type, subjects and methods used. This dataset is openly provided to the general and academic public through a Web application. Moreover, each event’s outreach is evaluated using these quantitative metrics, the results are analyzed with an emphasis on classification popularity and useful conclusions are drawn concerning the importance of artistic subjects, methods, and media.展开更多
To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,t...To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.展开更多
In order to improve the quality of web search,a new query expansion method by choosing meaningful structure data from a domain database is proposed.It categories attributes into three different classes,named as concep...In order to improve the quality of web search,a new query expansion method by choosing meaningful structure data from a domain database is proposed.It categories attributes into three different classes,named as concept attribute,context attribute and meaningless attribute,according to their semantic features which are document frequency features and distinguishing capability features.It also defines the semantic relevance between two attributes when they have correlations in the database.Then it proposes trie-bitmap structure and pair pointer tables to implement efficient algorithms for discovering attribute semantic feature and detecting their semantic relevances.By using semantic attributes and their semantic relevances,expansion words can be generated and embedded into a vector space model with interpolation parameters.The experiments use an IMDB movie database and real texts collections to evaluate the proposed method by comparing its performance with a classical vector space model.The results show that the proposed method can improve text search efficiently and also improve both semantic features and semantic relevances with good separation capabilities.展开更多
With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over...With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over them, providing such an integrated job search system over Web databases has become a Web application in high demand. Based on such consideration, we build a deep Web data integration system that supports unified access for users to multiple job Web sites as a job meta-search engine. In this paper, the architecture of the system is given first, and the key components in the system are introduced.展开更多
With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously i...With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously increase rapidly.Features of these data include massive volume,widespread distribution,multiple-sources,heterogeneous,multi-dimensional and dynamic in structure and time.The present study recommends an integrative visualization solution for these data,to enhance the visual display of data and data archives,and to develop a joint use of these data distributed among different organizations or communities.This study also analyses the web services technologies and defines the concept of the marine information gird,then focuses on the spatiotemporal visualization method and proposes a process-oriented spatiotemporal visualization method.We discuss how marine environmental data can be organized based on the spatiotemporal visualization method,and how organized data are represented for use with web services and stored in a reusable fashion.In addition,we provide an original visualization architecture that is integrative and based on the explored technologies.In the end,we propose a prototype system of marine environmental data of the South China Sea for visualizations of Argo floats,sea surface temperature fields,sea current fields,salinity,in-situ investigation data,and ocean stations.An integration visualization architecture is illustrated on the prototype system,which highlights the process-oriented temporal visualization method and demonstrates the benefit of the architecture and the methods described in this study.展开更多
Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient metho...Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.展开更多
Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “...Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.展开更多
How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learnin...How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learning of labeled samples and unlabeled database records in order to reduce the dependence on tediously hand-labeled training data. The pro- posed model was used to solve the problem of schema matching between data source schema and database schema. Experimental results using a large number of Web pages from diverse domains show the novel approach's effectiveness.展开更多
This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution langua...This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution language(BPEL)is modified in company with the analysis of data dependency and an exact representation of dead path elimination(DPE)is proposed,which over-comes the difficulties brought to dataflow analysis.Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language(WSDL)documents and the def-use annotated control flow graph is created.Based on this model,data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph,and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated,then testers can design the test cases according to the collected constraints for each path selected.展开更多
Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e...Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.展开更多
To meet the requirements of efficient management and web publishing for marine remote sensing data, a spatial database engine, named MRSSDE, is designed independently. The logical model, physical model, and optimizati...To meet the requirements of efficient management and web publishing for marine remote sensing data, a spatial database engine, named MRSSDE, is designed independently. The logical model, physical model, and optimization method of MRSSDE are discussed in detail. Compared to the ArcSDE, which is the leading product of Spatial Database Engine, the MRSSDE proved to be more effective.展开更多
Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of t...Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.展开更多
This paper proposed a novel multilevel data cache model by Web cache (MDWC) based on network cost in data grid. By constructing a communicating tree of grid sites based on network cost and using a single leader for ...This paper proposed a novel multilevel data cache model by Web cache (MDWC) based on network cost in data grid. By constructing a communicating tree of grid sites based on network cost and using a single leader for each data segment within each region, the MDWC makes the most use of the Web cache of other sites whose bandwidth is as broad as covering the job executing site. The experiment result indicates that the MDWC reduces data response time and data update cost by avoiding network congestions while designing on the parameters concluded by the environment of application.展开更多
A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tu...A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.U24B20146)the National Key Research and Development Plan in China(Grant No.2020YFB1005500)Beijing Natural Science Foundation Project(No.M21034).
文摘With the rapid development of web3.0 applications,the volume of data sharing is increasing,the inefficiency of big data file sharing and the problem of data privacy leakage are becoming more and more prominent,and the existing data sharing schemes have been difficult to meet the growing demand for data sharing,this paper aims at exploring a secure,efficient and privacy-protecting data sharing scheme under web3.0 applications.Specifically,this paper adopts interplanetary file system(IPFS)technology to realize the storage of large data files to solve the problem of blockchain storage capacity limitation,and utilizes ciphertext policy attribute-based encryption(CP-ABE)and proxy re-encryption(PRE)technology to realize secure multi-party sharing and finegrained access control of data.This paper provides the detailed algorithm design and implementation of data sharing phases and processes,and analyzes the algorithms from the perspectives of security,privacy protection,and performance.
文摘The World Wide Web provides a wealth of information about everything, including contemporary audio and visual art events, which are discussed on media outlets, blogs, and specialized websites alike. This information may become a robust source of real-world data, which may form the basis of an objective data-driven analysis. In this study, a methodology for collecting information about audio and visual art events in an automated manner from a large array of websites is presented in detail. This process uses cutting edge Semantic Web, Web Search and Generative AI technologies to convert website documents into a collection of structured data. The value of the methodology is demonstrated by creating a large dataset concerning audiovisual events in Greece. The collected information includes event characteristics, estimated metrics based on their text descriptions, outreach metrics based on the media that reported them, and a multi-layered classification of these events based on their type, subjects and methods used. This dataset is openly provided to the general and academic public through a Web application. Moreover, each event’s outreach is evaluated using these quantitative metrics, the results are analyzed with an emphasis on classification popularity and useful conclusions are drawn concerning the importance of artistic subjects, methods, and media.
基金Microsoft Research Asia Internet Services in Academic Research Fund(No.FY07-RES-OPP-116)the Science and Technology Development Program of Tianjin(No.06YFGZGX05900)
文摘To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.
基金Program for New Century Excellent Talents in University(No.NCET-06-0290)the National Natural Science Foundation of China(No.60503036)the Fok Ying Tong Education Foundation Award(No.104027)
文摘In order to improve the quality of web search,a new query expansion method by choosing meaningful structure data from a domain database is proposed.It categories attributes into three different classes,named as concept attribute,context attribute and meaningless attribute,according to their semantic features which are document frequency features and distinguishing capability features.It also defines the semantic relevance between two attributes when they have correlations in the database.Then it proposes trie-bitmap structure and pair pointer tables to implement efficient algorithms for discovering attribute semantic feature and detecting their semantic relevances.By using semantic attributes and their semantic relevances,expansion words can be generated and embedded into a vector space model with interpolation parameters.The experiments use an IMDB movie database and real texts collections to evaluate the proposed method by comparing its performance with a classical vector space model.The results show that the proposed method can improve text search efficiently and also improve both semantic features and semantic relevances with good separation capabilities.
基金Supportted by the Natural Science Foundation ofChina (60573091 ,60273018) National Basic Research and Develop-ment Programof China (2003CB317000) the Key Project of Minis-try of Education of China (03044) .
文摘With the rapid development of Web, there are more and more Web databases available for users to access. At the same time, job searchers often have difficulties in first finding the right sources and then querying over them, providing such an integrated job search system over Web databases has become a Web application in high demand. Based on such consideration, we build a deep Web data integration system that supports unified access for users to multiple job Web sites as a job meta-search engine. In this paper, the architecture of the system is given first, and the key components in the system are introduced.
基金Supported by the Knowledge Innovation Program of the Chinese Academy of Sciences (No.KZCX1-YW-12-04)the National High Technology Research and Development Program of China (863 Program) (Nos.2009AA12Z148,2007AA092202)Support for this study was provided by the Institute of Geographical Sciences and the Natural Resources Research,Chinese Academy of Science (IGSNRR,CAS) and the Institute of Oceanology, CAS
文摘With long-term marine surveys and research,and especially with the development of new marine environment monitoring technologies,prodigious amounts of complex marine environmental data are generated,and continuously increase rapidly.Features of these data include massive volume,widespread distribution,multiple-sources,heterogeneous,multi-dimensional and dynamic in structure and time.The present study recommends an integrative visualization solution for these data,to enhance the visual display of data and data archives,and to develop a joint use of these data distributed among different organizations or communities.This study also analyses the web services technologies and defines the concept of the marine information gird,then focuses on the spatiotemporal visualization method and proposes a process-oriented spatiotemporal visualization method.We discuss how marine environmental data can be organized based on the spatiotemporal visualization method,and how organized data are represented for use with web services and stored in a reusable fashion.In addition,we provide an original visualization architecture that is integrative and based on the explored technologies.In the end,we propose a prototype system of marine environmental data of the South China Sea for visualizations of Argo floats,sea surface temperature fields,sea current fields,salinity,in-situ investigation data,and ocean stations.An integration visualization architecture is illustrated on the prototype system,which highlights the process-oriented temporal visualization method and demonstrates the benefit of the architecture and the methods described in this study.
文摘Since web based GIS processes large size spatial geographic information on internet, we should try to improve the efficiency of spatial data query processing and transmission. This paper presents two efficient methods for this purpose: division transmission and progressive transmission methods. In division transmission method, a map can be divided into several parts, called “tiles”, and only tiles can be transmitted at the request of a client. In progressive transmission method, a map can be split into several phase views based on the significance of vertices, and a server produces a target object and then transmits it progressively when this spatial object is requested from a client. In order to achieve these methods, the algorithms, “tile division”, “priority order estimation” and the strategies for data transmission are proposed in this paper, respectively. Compared with such traditional methods as “map total transmission” and “layer transmission”, the web based GIS data transmission, proposed in this paper, is advantageous in the increase of the data transmission efficiency by a great margin.
文摘Influenza is a kind of infectious disease, which spreads quickly and widely. The outbreak of influenza has brought huge losses to society. In this paper, four major categories of flu keywords, “prevention phase”, “symptom phase”, “treatment phase”, and “commonly-used phrase” were set. Python web crawler was used to obtain relevant influenza data from the National Influenza Center’s influenza surveillance weekly report and Baidu Index. The establishment of support vector regression (SVR), least absolute shrinkage and selection operator (LASSO), convolutional neural networks (CNN) prediction models through machine learning, took into account the seasonal characteristics of the influenza, also established the time series model (ARMA). The results show that, it is feasible to predict influenza based on web search data. Machine learning shows a certain forecast effect in the prediction of influenza based on web search data. In the future, it will have certain reference value in influenza prediction. The ARMA(3,0) model predicts better results and has greater generalization. Finally, the lack of research in this paper and future research directions are given.
基金Supported by the National Defense Pre-ResearchFoundation of China(4110105018)
文摘How to integrate heterogeneous semi-structured Web records into relational database is an important and challengeable research topic. An improved model of conditional random fields was presented to combine the learning of labeled samples and unlabeled database records in order to reduce the dependence on tediously hand-labeled training data. The pro- posed model was used to solve the problem of schema matching between data source schema and database schema. Experimental results using a large number of Web pages from diverse domains show the novel approach's effectiveness.
基金the National Natural Science Foundation of China(60425206,60503033)National Basic Research Program of China(973 Program,2002CB312000)Opening Foundation of State Key Laboratory of Software Engineering in Wuhan University
文摘This paper proposes a method of data-flow testing for Web services composition.Firstly,to facilitate data flow analysis and constraints collecting,the existing model representation of business process execution language(BPEL)is modified in company with the analysis of data dependency and an exact representation of dead path elimination(DPE)is proposed,which over-comes the difficulties brought to dataflow analysis.Then defining and using information based on data flow rules is collected by parsing BPEL and Web services description language(WSDL)documents and the def-use annotated control flow graph is created.Based on this model,data-flow anomalies which indicate potential errors can be discovered by traversing the paths of graph,and all-du-paths used in dynamic data flow testing for Web services composition are automatically generated,then testers can design the test cases according to the collected constraints for each path selected.
文摘Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective.
基金Supported by the National 863 High-Tech Program of China (No.2007AA12Z237), the Natural Science Foundation of China (No. 40571123).
文摘To meet the requirements of efficient management and web publishing for marine remote sensing data, a spatial database engine, named MRSSDE, is designed independently. The logical model, physical model, and optimization method of MRSSDE are discussed in detail. Compared to the ArcSDE, which is the leading product of Spatial Database Engine, the MRSSDE proved to be more effective.
基金Note:Contents discussed in this paper are part of a key project,No.2000-A31-01-04,sponsored by Ministry of Science and Technology of P.R.China
文摘Web data extraction is to obtain valuable data from the tremendous information resource of the World Wide Web according to the pre - defined pattern. It processes and classifies the data on the Web. Formalization of the procedure of Web data extraction is presented, as well as the description of crawling and extraction algorithm. Based on the formalization, an XML - based page structure description language, TIDL, is brought out, including the object model, the HTML object reference model and definition of tags. At the final part, a Web data gathering and querying application based on Internet agent technology, named Web Integration Services Kit (WISK) is mentioned.
基金Supported by SEC E-Institute :Shanghai HighIn-stitutions Grid Project
文摘This paper proposed a novel multilevel data cache model by Web cache (MDWC) based on network cost in data grid. By constructing a communicating tree of grid sites based on network cost and using a single leader for each data segment within each region, the MDWC makes the most use of the Web cache of other sites whose bandwidth is as broad as covering the job executing site. The experiment result indicates that the MDWC reduces data response time and data update cost by avoiding network congestions while designing on the parameters concluded by the environment of application.
基金Supported by the National Natural Science Foun-dation of China (60573091 ,60273018)
文摘A vision based query interface annotation meth od is used to relate attributes and form elements in form based web query interfaces, this method can reach accuracy of 82%. And a user participation method is used to tune the result; user can answer "yes" or "no" for existing annotations, or manually annotate form elements. Mass feedback is added to the annotation algorithm to produce more accurate result. By this approach, query interface annotation can reach a perfect accuracy.