A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human underst...A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.展开更多
The leakage of sensitive data occurs on a large scale and with increasingly serious impact. It may cause privacy disclosure or even property damage. Password leakage is one of the fundamental reasons for information l...The leakage of sensitive data occurs on a large scale and with increasingly serious impact. It may cause privacy disclosure or even property damage. Password leakage is one of the fundamental reasons for information leakage, and its importance is must be emphasized because users are likely to use the same passwords for different Web application accounts. Existing approaches use a password manager and encrypted Web application to protect passwords and other sensitive data; however, they may be compromised or lack accessibility. The paper presents SecureWeb, which is a secure, practical, and user-controllable framework for mitigating the leakage of sensitive data. SecureWeb protects users' passwords and aims to provide a unified protection solution to diverse sensitive data. The efficiency of the developed schemes is demonstrated and the results indicate that it has a low overhead and are of practical use.展开更多
文摘A large amount of data is present on the web which can be used for useful purposes like a product recommendation,price comparison and demand forecasting for a particular product.Websites are designed for human understanding and not for machines.Therefore,to make data machine-readable,it requires techniques to grab data from web pages.Researchers have addressed the problem using two approaches,i.e.,knowledge engineering and machine learning.State of the art knowledge engineering approaches use the structure of documents,visual cues,clustering of attributes of data records and text processing techniques to identify data records on a web page.Machine learning approaches use annotated pages to learn rules.These rules are used to extract data from unseen web pages.The structure of web documents is continuously evolving.Therefore,new techniques are needed to handle the emerging requirements of web data extraction.In this paper,we have presented a novel,simple and efficient technique to extract data from web pages using visual styles and structure of documents.The proposed technique detects Rich Data Region(RDR)using query and correlative words of the query.RDR is then divided into data records using style similarity.Noisy elements are removed using a Common Tag Sequence(CTS)and formatting entropy.The system is implemented using JAVA and runs on the dataset of real-world working websites.The effectiveness of results is evaluated using precision,recall,and F-measure and compared with five existing systems.A comparison of the proposed technique to existing systems has shown encouraging results.
基金supported by the National Key Basic Research Program of China (No. 2013CB834204)the National Natural Science Foundation of China (Nos. 61672300 and 61772291)+1 种基金the Natural Science Foundation of Tianjin, China (Nos. 16JCYBJC15500 and 17JCZDJC30500)the Open Project Foundation of Information Security Evaluation Center of Civil Aviation, and Civil Aviation University of China (No. CAACISECCA-201702)
文摘The leakage of sensitive data occurs on a large scale and with increasingly serious impact. It may cause privacy disclosure or even property damage. Password leakage is one of the fundamental reasons for information leakage, and its importance is must be emphasized because users are likely to use the same passwords for different Web application accounts. Existing approaches use a password manager and encrypted Web application to protect passwords and other sensitive data; however, they may be compromised or lack accessibility. The paper presents SecureWeb, which is a secure, practical, and user-controllable framework for mitigating the leakage of sensitive data. SecureWeb protects users' passwords and aims to provide a unified protection solution to diverse sensitive data. The efficiency of the developed schemes is demonstrated and the results indicate that it has a low overhead and are of practical use.