Measuring the web crawler ethics


Measuring The Web Crawler Ethics

C. Lee Giles
College of Information

Sciences and Technology
Pennsylvania State University

University Park, PA, USA
giles@ist.psu.edu

Yang Sun
AOL Research

888 Villa St
Mountain View, CA, USA

yang.sun@corp.aol.com

Isaac G. Councill
Google Inc.

76 Ninth Avenue 4th Floor
New York, NY, USA

icouncill@gmail.com

ABSTRACT
Web crawlers are highly automated and seldom regulated
manually. The diversity of crawler activities often leads to
ethical problems such as spam and service attacks. In this
research, quantitative models are proposed to measure the
web crawler ethics based on their behaviors on web servers.
We investigate and define rules to measure crawler ethics,
referring to the extent to which web crawlers respect the
regulations set forth in robots.txt configuration files. We
propose a vector space model to represent crawler behav-
ior and measure the ethics of web crawlers based on the
behavior vectors. The results show that ethicality scores
vary significantly among crawlers. Most commercial web
crawlers’ behaviors are ethical. However, many commer-
cial crawlers still consistently violate or misinterpret certain
robots.txt rules. We also measure the ethics of big search
engine crawlers in terms of return on investment. The re-
sults show that Google has a higher score than other search
engines for a US website but has a lower score than Baidu
for Chinese websites.

Categories and Subject Descriptors
K.4.1 [Public Policy Issues]: Ethics; K.4.1 [Public Pol-
icy Issues]: Privacy

General Terms
Measurement, Design, Experimentation, Algorithms

Keywords
robots.txt, web crawler ethics, ethicality, privacy

1. INTRODUCTION
Web crawlers have been widely used for search engines as

well as many other web applications to collect content from
the Web. These crawlers are highly automated and seldom
regulated manually. With the fast growing online services
relying on Web crawlers to collect Web pages, the function-
alities and activities of web crawlers have become extremely
diverse. Crawler activities typically include requests of web
pages for general-purpose text indexing and searching, ex-
traction of email and personal identity information for busi-
ness purposes as well as for malicious purposes. Accessing

Copyright is held by the author/owner(s).
WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA.
ACM 978-1-60558-799-8/10/04.

the web information with automated Web crawlers can lead
to ethical problems of privacy and security. For example,
crawlers can extract personal contact information for spam
purposes and identity theft. Crawlers may also overload
a website such that normal user access is impeded. Web
crawler activities can be regulated from the server side by
deploying Robots Exclusion Protocol (a set of rules in a file
called robots.txt) in the root directory of a website, allow-
ing webmasters to indicate to visiting crawlers which parts
of their sites should not be visited as well as a minimum
time interval between visits. A recent study shows more
than 30% of active websites employ this standard to reg-
ulate crawler activities [2, 3]. However, since the Robots
Exclusion Protocol (REP) serves only as an unenforced ad-
visory to crawlers, web crawlers may ignore the rules and
access part of the forbidden information on a website. Vi-
olating the robots.txt rules can lead to serious privacy and
security concerns. Thus, measuring crawler ethics becomes
an important task to help detecting improper crawler behav-
ior in early stages as well as identifying unethical crawlers.
The issues of crawler ethics, however, did not bring enough
attention to the research community and are under studied.
Crawler ethics are not limited to whether crawlers obeying
website rules, but also can be studied in terms of the value
provide to websites. If a crawler provides zero value to the
crawled website, it should also be considered less ethical
than those who provide positive values.

In this research, we propose a vector space model of mea-
suring web crawler ethics based on the Robots Exclusion
Protocol. We define the ethicality metric to measure web
crawler ethics. We also study the ethics of big search engine
crawlers in terms of return on investment where crawler vis-
its are considered investments from websites and correspond-
ing search engine traffic is considered as returns. The results
show that Google has a much higher score in US websites
but has a lower score than Baidu in Chinese websites.

2. RELATED WORK
The ethical factors are examined from three perspectives

[4] : denial of service, cost, and privacy. An ethical crawl
guideline is described for crawler owners to follow. This
guideline suggests taking legal action or initiating a profes-
sional organization to regulate web crawlers. Our research
adopts these perspectives of crawler ethics and expands it to
a computational measure. The ethical issues of administrat-
ing web crawlers are discussed in [1]. It provides a guideline
for ethical crawlers to follow. The guideline also gives great
insights to our research of ethics measurements. However,

WWW 2010 • Poster April 26-30 • Raleigh • NC • USA

1101


none of the above mentioned work provides a quantitative
measure of web crawler ethics.

3. CRAWLER BEHAVIOR MODEL
In our research, each web crawler’s behavior is modeled

as a vector in the rule space where rules are specified by
Robots Exclusion Protocol to regulate the crawler behavior.
If a crawler violates a rule, the corresponding vector element
is larger than 0. Websites can also be modeled in the rules
space that if a website includes a rule in its robots.txt file,
the corresponding vector element is larger than 0. The ac-
tual value for a rule element can be defined based on the
consequences or cost of violating such rule.

We define content ethicality Ec and access ethicality Ea
scores to evaluate web crawler ethics. In content ethicality,
cost is defined as the number of restricted web pages or web
directories being unethically accessed (see Eq. 1).

Ec(C) =
∑

wi∈W

||VC(wi)||
||D(wi)||

. (1)

Access ethicality is defined as how a crawler respects the
desired visit interval (crawl-delay rule in robots.txt file) of
the website(see Eq. 2).

Ea(r) =
∑

wi∈W

e−(intervalC(wi)−delay(wi))

1 + e−(intervalC(wi)−delay(wi))
(2)

A major advantage for websites allowing search engine
crawlers to crawl their web pages is that the search engines
bring traffic back to them. From this perspective, being
ethical for a web crawler means bringing more visits back
to the crawled websites. The effective ethicality of search
engine S to a website can be defined as the ratio between
the user visits referred by the search engine to the website
and visits generated by the crawler r of the search engine to
the website (see Eq. 3).

Eeffective(r) =
Referenced(S)

Crawled(r)
(3)

4. EXPERIMENTS
Rank User-agent Content Ethicality

1 hyperestraier/1.4.9 0.95621
2 Teemer 0.01942
3 msnbot-media/1.0 0.00632
4 Yahoo! Slurp 0.00417
5 charlotte/1.0b 0.00394
6 gigabot/3.0 0.00370
7 nutch test/nutch-0.9 0.00316
8 googlebot-image/1.0 0.00315
9 Ask Jeeves/Teoma 0.00302
10 googlebot/2.1 0.00282

Table 1: Content ethicality scores for crawlers vis-
ited our test site.

Table 1 and 2 list the content and access ethicality results
for top crawlers that visited our test website during the time
of the study. Higher ethicality scores represent unethical
crawlers.

The effective ethicality of Google, Yahoo, MSN and Baidu
are shown in Table 3. The data is collected between 2008/05/13

Rank User-agent Access Ethicality
1 msnbot-media/1.0 0.3317
2 hyperestraier/1.4.9 0.3278
3 Yahoo! Slurp/3.0 0.2949
4 Teemer 0.2744
5 Arietis/Nutch-0.9 0.0984
6 msnbot/1.0 0.098
7 disco/Nutch-1.0-dev 0.0776
8 ia archiver 0.077
9 gigabot/3.0 0.0079
10 googlebot/2.1 0.0075

Table 2: Access ethicality scores for crawlers visited
our test site.

to 2008/06/21. Site 1 is CiteSeerx, a large scale academic
digital library for computer science. Site 2 is a Chinese
movie information website. Site 3 is guopi.com, an online
makeup retail store.

5. CONCLUSIONS
We formally defined three ethicality scores to measure web

crawler ethics. Results show that most commercial crawlers
receive a good ethicality scores. However, it is surprising
to see commercial crawlers constantly disobeying or misin-
terpreting some robots.txt rules. The crawling algorithms
and policies that lead to such behaviors are unknown. How-
ever, obtaining more content is an obvious reason for most
crawlers failing to obey certain rules.

Website Crawled Referenced Ereturn

google
Site 1 16799253 260898 0.01553
Site 2 872001 46469 0.05329
Site 3 368417 145115 0.39389

yahoo
Site 1 17375962 3919 0.00023
Site 2 502584 1249 0.00249
Site 3 315119 11819 0.03751

msn
Site 1 677181 362 0.00054
Site 2 16330 5448 0.33362
Site 3 51128 3801 0.07434

baidu
Site 1 27 37 1.37037
Site 2 622667 61964 0.09951
Site 3 1830847 844786 0.46142

Table 3: Comparison of the effectiveness of Google,
Yahoo, MSN and Baidu.

The effective ethicality scores of search engines varies sig-
nificantly for different websites. Ranking by the referenced
visits, Google plays a dominating role in the US based site 1
and ranks the 2nd and 3rd in the two China based websites.
Baidu leads in the search market in China.

6. REFERENCES
[1] D. Eichmann. Ethical web agents. Computer Networks

and ISDN Systems, 28(1-2):127–136, 1995.

[2] S. Kolay, P. D’Alberto, A. Dasdan, and
A. Bhattacharjee. A larger scale study of robots.txt. In
WWW ’08: Proceeding of the 17th international
conference on World Wide Web, pages 1171–1172, New
York, NY, USA, 2008. ACM.

[3] Y. Sun, Z. Zhuang, and C. L. Giles. A large-scale study
of robots.txt. In WWW ’07, 2007.

[4] M. Thelwall and D. Stuart. Web crawling ethics
revisited: Cost, privacy, and denial of service. J. Am.
Soc. Inf. Sci. Technol., 57(13):1771–1779, November
2006.

WWW 2010 • Poster April 26-30 • Raleigh • NC • USA

1102