摘要
信息时代的到来,知网(CNKI)成为国内最大的论文数据库,如何高效地获取论文信息,挖掘论文价值,成为了一个亟待解决的问题。目前,论文检索工具多为通用爬虫,只能采集到部分少量的信息,且包含着不符合用户要求的信息,因此实现一个集聚焦论文信息采集和实时论文数据分析的系统变得极为重要。该系统针对如何高效获取论文信息,挖掘论文价值等问题,使用Python Django框架和Celery框架将网站和爬虫结合,实现了爬虫的自动化。系统分为论文爬取模块和多维度分析模块。其中,论文爬取模块使用Selenium,模拟用户点击,并使用Beutifulsoup4和Requests解析网页内容,最后将获取到的论文信息存储到MySQL数据库中。多维度分析模块使用High Charts进行数据展示,主要对与关键词相关的论文发表趋势,高产作者、机构等信息进行分析。通过该系统,科研学者可以方便快捷地获取到研究领域的各种论文信息,为以后的深入研究提供数据支撑。
With the advent of the information age,CNKI has become the largest paper database in China. How to efficiently obtain paper information and excavate paper value has become an urgent problem to be solved. At present,the paper retrieval tools are mostly general crawlers,which can only collect a small amount of information and contain information that does not meet the user’s requirements. Therefore,it is of great importance to implement a focused paper information collection and real-time paper data analysis system. For this purpose,Python Django framework and Celery framework are used to combine the website with the crawler and realize the automation of the crawler. The system is divided into a paper crawling module and a multidimensional analysis module. Among them,the paper crawling module uses Selenium to simulate user clicks,and parses web content with Beutifulsoup4 and Requests,and finally stores them in MySQL database. The multidimensional analysis module uses High Charts to display,which mainly analyze the trend of papers,high-yielding authors,institutions and other information about keywords. Through this system,researchers can quickly and easily obtain various information in the field of research,and provide data support for future research.
作者
王树梅
尚衍亮
WANG Shu-mei;SHANG Yan-liang(School of Computer Science and Technology,Jiangsu Normal University,Xuzhou 222111,China)
出处
《计算机技术与发展》
2020年第5期165-169,共5页
Computer Technology and Development
基金
国家自然科学基金(61673196)。
关键词
论文爬取
多维度分析
数据挖掘
信息采集
爬虫自动化
paper crawling
multidimensional analysis
data mining
information collection
crawler automation