International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

DOI: 10.21307/ijanmc-2020-018                             64 

Application Research of Crawler and Data Analysis Based 

on Python 

 
Wu Hejing 

East University of Heilongjiang 

Heilongjiang, 150086 

E-mail: 499917928@qq.com, 

 
Liu Fang 

East University of Heilongjiang 

Heilongjiang, 150086） 

 
Zhao Long 

East University of Heilongjiang 

Heilongjiang, 150086 

Shao Yabin 

East University of Heilongjiang 

Heilongjiang, 150086 

Cui Ran 

East University of Heilongjiang 

Heilongjiang, 150086 

 
Abstract—Combined with the actual situation, this paper 

explores how to develop a crawler method based on the specific 

framework for the complete interface of steam manufacturers 

and stores, which should be able to automatically and 

efficiently crawl the data of specific targets, analyze the 

dynamic pages, and complete the data cleaning, downloading, 

saving and other operations, explore the methods of general 

data analysis, and Analyze the downloaded data, extract useful 

information from it, analyze and summarize the specific 

crawler method and data analysis method through practical 

application. 

Keywords-Python; Scrapy; Selenium; BeautifulSoup 

I. INTRODUCTION 

The 21st century is a book written by information. 

With the rapid development of information technology, 

today's society has become a huge information polymer, 

and there are various kinds of data in this huge polymer. 

Data is a kind of embodiment of information. In this 

era of information explosion, how to efficiently find 

the data we want from all kinds of miscellaneous data 

and extract them from the network in batches has 

become a key problem. However, sometimes the 

unprocessed data itself may be confusing for people. 

How to process the huge and complex data obtained 

through what kind of technical means, and finally 

become an intuitive number, or trend, and become the 

information that people can obtain intuitively is also a 

very important topic to be studied in this data age. 

II. STATISTICAL INVESTIGATION ON THE 

PREFERENCE SALES VOLUME  

In this project, the American Steam online game 

platform mall is selected as the research object of the 

crawler. By setting a specific game company as a 

search keyword in steam's online mall, the data of all 

works of the company in steam mall are crawled, and 

the useful information is extracted by analyzing the 

basic data of each manufacturer's preference for game 

production type, series sales volume, and praise In 

addition, the game manufacturers are comprehensively 

scored and evaluated.  


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

65 

III. RELEVANT TECHNOLOGY AND FRAMEWORK 

This project will use the scrapy framework based on 

Python language to crawl steam website. Python as a 

language has the advantages of lightweight, simplicity, 

wide range of application and so on. At present, 

various crawler frameworks and application libraries 

based on Python have been very mature, among which 

the crawler framework is very popular in the 

application of general web crawlers. Its first version 

was released in 2008, and now it is quite mature as a 

crawler framework. The basicprinciple of the scrapy 

framework is shown in Figure 1. 

 
Figure 1. Basic principles of Scrapy frame 

 
IV. DESIGN OF CRAWLER 

A.  General design idea 

The process of crawler itself is actually to simulate 

the user's operation on the browser with a program. 

First of all, the starting point and range of crawling 

need to be specified. As the target of crawling is for 

manufacturers and their works, the interface of 

manufacturers is taken as the starting point. For 

example, the page of paradox, a manufacturer, first 

analyzes the entire manufacturer's page, and finds that 

the page links and information of all games or game 

related DLC downloads of the manufacturer are stored 

in the recommendation div framework of each sub 

recommendation of recommendations rows, as shown 

in Figure 2 

B.  Design and implementation of reptile functions 

The crawler architecture is composed of items, 

spiders, piplings and middleware. Among them, items 

are mainly used to define the items to be crawled, 

spiders are responsible for defining the whole process 

of crawling, what means to crawl, pipes are responsible 

for the basic operations such as data cleaning and 

saving, middleware can be responsible for the bridge 

service of scratch and other plug-ins or architectures. 

 
Responses 

Requests 

Scheduler 

Spiders 

Item Pipeline Downloader  

Internet 

ScrapyEngine 

Spider middlewares 

Downloader 

middlewares 

Item 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

66 

 
Figure 2. Investigation of HTML page structure of steam manufacturers by using viewers 

 
First, the items to be crawled are defined in the 

items file. Finally, these items may be submitted to the 

analysis part for data analysis. The specific design and 

implementation code is: 

import scrapy 

class SteamDevItem(scrapy.Item): 

    # define the fields for your item here like: 

    # name = scrapy.Field() 

    qry_nam = scrapy.Field() 

    if_dev = scrapy.Field()  

    pub_sum = scrapy.Field()  

    pub_gam_sum = scrapy.Field()   

    pub_dlc_sum = scrapy.Field()  

    dev_nam = scrapy.Field()   

    pub_nam = scrapy.Field() 

    gam_title = scrapy.Field()  

    res_date = scrapy.Field()  

    gam_type = scrapy.Field()  

    gam_tag = scrapy.Field()     

    if_muti = scrapy.Field()    

    gam_score = scrapy.Field()  

    gam_score_sum = scrapy.Field()  

    gam_score_ratio = scrapy.Field()    

 pass 

C.  Spider design 

The design of spider is the key point of this project. 

Whether the initial dynamic page connection or the last 

static page information crawling mode will be defined 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

67 

in this file. In this project, spider will be named steam, 

and some key implementation codes will be pasted 

here, with running results and some notes attached. 

First, introduce start_ the design method of dynamic 

page crawling of selenium in requests method: 

chrome_opt = webdriver.ChromeOptions()  

    prefs = { 

        
"profile.managed_default_content_settings.images": 2, 

        'permissions.default.stylesheet': 2 

    } 

    chrome_opt.add_experimental_option("prefs", 

prefs) 

    browser = 

webdriver.Chrome(options=chrome_opt) 

    browser.get("https://store.steampowered.com/" 

+ Qry_sta + "/" + Qry_Target) 

    bs = BeautifulSoup(browser.page_source, 

'html.parser')     #Beautiful Soup  

The specific store connections of each product exist 

in the a anchor label of each entry, and these 

connections are read to the defined links using the 

loop_ In the list list, crawling of the list is completed, 

but sometimes the text and picture in the entry may 

contain a tag, and they all point to the same page. If 

direct application may cause repeated crawling, a loop 

is used here, and if not in statement is used to de 

duplicate the list. 

After using the print statement to verify the function 

of the module, the verification results are shown in 

Figure 3. 

 
Figure 3. List of URLs obtained by selenium and beautiful soup 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

68 

D.  Start directional climbing 

After designing and debugging the spider, run the 

CMD command window of the system, open the root 

directory of the crawler file, and input the crawler 

stream-o SteamDev.csv , crawl the target website. 

Input - O SteamDev.csv The purpose is to let the 

crawler save the last crawled data in the form of CSV 

table. The saved data appears in the project root. See 

Figure 4 for the climbing process. 

 
Figure 4. Executing the start request method selenium pop-up browser to crawl the dynamic page 

 
V. DATA ANALYSIS 

Next, we will perform basic visual operations on the 

crawled data in the form of operation tables. In the 

crawler project, we crawled for the Paradox Interactive 

publisher. The crawled data is presented in the form of 

CSV tables, as shown in Figure 5. 

Through the use of spreadsheets and further 

collation of the crawled data, the following data are 

obtained: the publisher has published 396 works in 

steam platform, of which the majority of DLC has 

published 334 DLC, most of the games published are 

single player games, and each game published in its 

mall has an average of 6800 reviews, of which the 

proportion of favorable reviews is about 76.4 8%, see 

the chart below for detailed visual analysis. 

 
International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

69 

 
Figure 5. Crawled data list 

 
Figure 6. Output the publisher platform follower ranking chart 


International Journal of Advanced Network, Monitoring and Controls      Volume 05, No.02, 2020 

70 

VI. CONCLUSION 

Through demonstration and part of practice, this 

paper explores the process of data crawling and basic 

data analysis of dynamic pages by combining the 

general Python's story framework with selenium + 

beautiful soup through crawling the steam online game 

mall website. 

The crawler has good scalability. For example, if 

you want to compare the crawling data of multiple 

game manufacturers, you can write a query 

manufacturer list to get the product URL list from the 

dynamic web page of the manufacturer list first. In 

terms of anti-crawler, selenium itself has a very good 

anti crawler ability. If you want to further anti crawler, 

you can also expand multiple cookies, and even 

establish a proxy IP pool.  

ACKNOWLEDGMENT 

This paper is about the scientific research project of 

Heilongjiang Oriental University in 2019, 

"Implementation of Crawler Based on Python Scrapy 

Framework", project number HDFKY190109 

REFERENCE 

[1] Yuhao Fan. Design and implementation of distributed crawler system 
based on scrapy[J].IOP Conference Series: Earth and Environmental 
Science,2018,108(4):2-8. 

[2] Jing Wang, Yuchun Guo. Scrapy-based crawling and user-behavior 
characteristics analysis on taobao[P]. Cyber-Enabled Distributed 
Computing and Knowledge Discovery (CyberC), 2012 International 
Conference on, 20120:1-5. 

[3]  Ryan Mitchell. Python web crawler authority Guide (Second Edition) 
[M]. Beijing: People's post and Telecommunications Press, 
2019:57-70. 

[4] Wei Chengcheng. Data information crawler technology based on 
Python [J]. Electronic world, 2018 (11): 208-209. 

[5] Mark.Lutz . Python learning manual (Fifth Edition, Volume I) [M]. 
Beijing: Mechanical Industry Press, 2019:1-2. 

[6] Fan Chuanhui. Python reptile development and project practice [M]. 
Beijing: Mechanical Industry Press, 2017 (3): 69-72. 

[7] Song Yongsheng, Huang Rongmei, Wang Jun. research on Python 
based data analysis and visualization platform [J]. Modern 
information technology: 2019 (21): 1-4. 

[8] Liu Yuke, Wang Ping. Statistics and graph output of student 
achievement data based on Python + pandas + Matplotlib [J]. Fujian 
computer. 2017 (11): 2-6. 

[9] Liu Yuke, Wang Ping. Statistics and graph output of student 
achievement data based on Python + pandas + Matplotlib [J]. Fujian 
computer. 2017 (11): 2-6. 

[10] Long Hu, Yang Hui. Data analysis and visualization in the context of 
big data [J]. Journal of Kaili University. 2016 (03): 1-3.