The data collection and web crawling course has a lot of theoretical knowledge and strong practicality.Traditional teaching methods are no longer sufficient to meet teaching needs.Based on the characteristics of the c...The data collection and web crawling course has a lot of theoretical knowledge and strong practicality.Traditional teaching methods are no longer sufficient to meet teaching needs.Based on the characteristics of the course,this article constructs a mixed teaching environment based on“Learning Pass+Hongya Platform+Offline Course,”integrates teaching resource libraries and ideological and political cases,and develops a suitable evaluation system to cultivate students’innovative and critical thinking abilities,stimulate their learning initiative,improve their teamwork ability,and enhance their professional level and data literacy.展开更多
Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as socia...Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as social relationship modeling,community detection,user profiling,and user behavior prediction.However,the leak of such information will also pose severe challenges for user privacy preserving due to its usefulness in characterizing users.Large-scale web crawling-based information probing is a representative way for obtaining topological information of online web services.In this paper,we explore how to defend against topological information probing for online web services,with a particular focus on online decentralized web services such as Mastodon.Different from traditional centralized web services,the federated nature of decentralized web services makes the identification of distributed crawlers even more difficult.We analyze the behavioral differences between legitimate users and crawlers in decentralized web services and highlight two key behavioral attributes that distinguish crawlers from legitimate users:instance interaction preferences and hop count in profile viewing patterns.Based on these insights:we propose a supervised machine learning-based framework for crawler detection,which is able to learn the federation-aware feature representations for users.To validate the framework’s effectiveness,we construct a labeled dataset that integrates real users with real-trace driven simulated crawlers in Mastodon.We use this dataset to train various supervised classifiers for crawler detection.Experimental results demonstrate that our framework can achieve an excellent classification performance.Moreover,it is observed that federation-aware features are effective in improving detection performance.展开更多
基金supported by the Quality Engineering Project of Guangdong University of Science and Technology under Grant GKZLGC2024160。
文摘The data collection and web crawling course has a lot of theoretical knowledge and strong practicality.Traditional teaching methods are no longer sufficient to meet teaching needs.Based on the characteristics of the course,this article constructs a mixed teaching environment based on“Learning Pass+Hongya Platform+Offline Course,”integrates teaching resource libraries and ideological and political cases,and develops a suitable evaluation system to cultivate students’innovative and critical thinking abilities,stimulate their learning initiative,improve their teamwork ability,and enhance their professional level and data literacy.
基金funded by the National Key R&D Program of China under Grant(No.2022YFB3102901)National Natural Science Foundation of China(No.62072115,No.62102094)Shanghai Science and Technology Innovation Action Plan Project(No.22510713600).
文摘Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as social relationship modeling,community detection,user profiling,and user behavior prediction.However,the leak of such information will also pose severe challenges for user privacy preserving due to its usefulness in characterizing users.Large-scale web crawling-based information probing is a representative way for obtaining topological information of online web services.In this paper,we explore how to defend against topological information probing for online web services,with a particular focus on online decentralized web services such as Mastodon.Different from traditional centralized web services,the federated nature of decentralized web services makes the identification of distributed crawlers even more difficult.We analyze the behavioral differences between legitimate users and crawlers in decentralized web services and highlight two key behavioral attributes that distinguish crawlers from legitimate users:instance interaction preferences and hop count in profile viewing patterns.Based on these insights:we propose a supervised machine learning-based framework for crawler detection,which is able to learn the federation-aware feature representations for users.To validate the framework’s effectiveness,we construct a labeled dataset that integrates real users with real-trace driven simulated crawlers in Mastodon.We use this dataset to train various supervised classifiers for crawler detection.Experimental results demonstrate that our framework can achieve an excellent classification performance.Moreover,it is observed that federation-aware features are effective in improving detection performance.