Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as socia...Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as social relationship modeling,community detection,user profiling,and user behavior prediction.However,the leak of such information will also pose severe challenges for user privacy preserving due to its usefulness in characterizing users.Large-scale web crawling-based information probing is a representative way for obtaining topological information of online web services.In this paper,we explore how to defend against topological information probing for online web services,with a particular focus on online decentralized web services such as Mastodon.Different from traditional centralized web services,the federated nature of decentralized web services makes the identification of distributed crawlers even more difficult.We analyze the behavioral differences between legitimate users and crawlers in decentralized web services and highlight two key behavioral attributes that distinguish crawlers from legitimate users:instance interaction preferences and hop count in profile viewing patterns.Based on these insights:we propose a supervised machine learning-based framework for crawler detection,which is able to learn the federation-aware feature representations for users.To validate the framework’s effectiveness,we construct a labeled dataset that integrates real users with real-trace driven simulated crawlers in Mastodon.We use this dataset to train various supervised classifiers for crawler detection.Experimental results demonstrate that our framework can achieve an excellent classification performance.Moreover,it is observed that federation-aware features are effective in improving detection performance.展开更多
基金funded by the National Key R&D Program of China under Grant(No.2022YFB3102901)National Natural Science Foundation of China(No.62072115,No.62102094)Shanghai Science and Technology Innovation Action Plan Project(No.22510713600).
文摘Topological information is very important for understanding different types of online web services,in particular,for online social networks(OSNs).People leverage such information for various applications,such as social relationship modeling,community detection,user profiling,and user behavior prediction.However,the leak of such information will also pose severe challenges for user privacy preserving due to its usefulness in characterizing users.Large-scale web crawling-based information probing is a representative way for obtaining topological information of online web services.In this paper,we explore how to defend against topological information probing for online web services,with a particular focus on online decentralized web services such as Mastodon.Different from traditional centralized web services,the federated nature of decentralized web services makes the identification of distributed crawlers even more difficult.We analyze the behavioral differences between legitimate users and crawlers in decentralized web services and highlight two key behavioral attributes that distinguish crawlers from legitimate users:instance interaction preferences and hop count in profile viewing patterns.Based on these insights:we propose a supervised machine learning-based framework for crawler detection,which is able to learn the federation-aware feature representations for users.To validate the framework’s effectiveness,we construct a labeled dataset that integrates real users with real-trace driven simulated crawlers in Mastodon.We use this dataset to train various supervised classifiers for crawler detection.Experimental results demonstrate that our framework can achieve an excellent classification performance.Moreover,it is observed that federation-aware features are effective in improving detection performance.