Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs


			Internet Archive Blogs

			A blog from the team at archive.org

		
			Menu
			Skip to content
				Blog
	Announcements
	Internet Archive Store
	archive.org
	About
	Events
	Developers
	Donate


						Robots.txt meant for search engines don’t work well for web archives

						
				Posted on April 17, 2017 by Mark Graham			

		
			Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search. 

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!)  Some crawlers, like Google, pay attention to robots.txt directives, while others do not. 

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes.  Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.  We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine.  In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore.  We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign.  We are now looking to do this more broadly.  

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

					
			Posted in Announcements, News |
			34 Replies
								
	
					Post navigation

					← A Few Advanced Search Tips
					DRM for the Web is a Bad Idea →
				

			34 thoughts on “Robots.txt meant for search engines don’t work well for web archives”		


				Daniel April 17, 2017 at 2:18 pm				

				
				So the plan is to no longer respect robots.txt files with directives that explicitly say User-Agent: ia_archiver? or User-Agent: *? Many website explicitly block the Internet Archive’s ia_archiver crawler while allowing other crawlers.

Have you considered adopting AppleNewsBot’s policy of pretending to be Googlebot? Robots directives written for Googlebot is more permissive than rules for other crawlers. Also, many sites block everything but Googlebot.

I’m kind of torn on whether I think it’s a good thing if you improve the archive by ignoring the wishes of webmasters or not. I often run into issues with pages missing from the archive only to discover that the website has specifically excluded the ia_archiver. However, I still believe it’s important to preserve a standardized mechanism for controlling crawlers and bots of all kinds.

								
		Pingback: News Roundup | LJ INFOdocket 

				
				Joshua April 17, 2017 at 11:31 pm				

				
				A better choice would probably have been to respect robots.txt as of the time you crawled it; that is once archived changing robots.txt later doesn’t change its visibility. Oh well.

								
				MeditateOrDie April 18, 2017 at 7:02 am				

				
				I agree with this mostly, though some sites seem to block archive.org for no sensible reason – then when their site eventually dies, as all sites do sooner or later, all of that good, useful info is lost forever.

Perhaps giving more advanced users a means of over-riding robots.txt on a per-save and per-read basis while still keeping the default behaviors active for general use, might be a compromise that satisfies the needs of dutiful archivers, general users and webmasters.

Something like a URL modification could be used to do the trick (this used to work in the past until this handy undocumented functionality was removed). eg: Additions of varying numbers of “.” in the right parts of a URL used to work nicely for saves and reads, until fairly recently.

There’s not much point in us ‘saving the web’ if we human beings cannot access the archives because of our robot[s.txt] overlords!

https://web.archive.org/web/https://u.cubeupload.com/ZkJ7hq.gif 🙂

								
				Shannon April 18, 2017 at 4:22 pm				

				
				Yay! Sites that have retroactively changed their robots.txt have caused me a *lot* of problems for (recent) historical research. The info is often still out there, but it can take additional hours to find it. And sometimes it’s just gone instead.

It never made any sense to block access to the archives today when the sites had allowed your crawling yesterday.

								
				Adam April 19, 2017 at 12:55 pm				

				
				There is another problem with robots.txt that everyone on here should know. There are 2 ways that archive could accidently be removed:

-If a website goes dead, and a different web owner opens up his/her website with the same URL as the dead one with robots.txt, the entire archive, including the dead website, will be removed. Even if the new owner has nothing to own on the previous website.

-If a site gets hacked to have robots.txt, the same thing happened as above.

Web archive should change their system so that the machine that checks robots.txt would not remove already-archived pages, only disable the process of archiving of the page after robots.txt is established. To remove all history, would require to talk to archive.org to “flag” that site so that users cannot archive it. Also have a protection so that if web owners tries to remove history that isn’t theirs, would deny them (they are required to show proof of it) from deleting the dead site.

Please let me know by sending an email to Adomsyik@gmail.com.

								
				MKKLSDKLS April 19, 2017 at 9:16 pm				

				
				As someone who has seen many good website be “park-nuked” and kicked out of the publicly accessible Archive, I beg you people to ignore the parked website robots.txt’s wishes.

If we really want to archive the web, people who have literally no relation to the previous website beyond usurping the previous name via domain-squatting should have no say in what is archived.

								
				Andy L April 20, 2017 at 6:56 pm				

				
				I’m happy to hear about this change.  

I’ve always thought it was a shame that changes to the modern robot.txt files are able to reach back in time and scrub the site from existence. 

I guess I understand why that policy was put into place, but it doesn’t seem to make sense long term.  For a convenient domain, its current website and owner might be completely unrelated to the historic page that was there before.

								
				Georgene Uddin April 20, 2017 at 7:26 pm				

				
				Hey There. I discovered your weblog the usage of msn. That is an extremely well written article. I’ll make sure to bookmark it and come back to read extra of your useful information. Thank you for the post. I’ll definitely return.|

								
				posty April 21, 2017 at 8:20 am				

				
				Could we expand this to more than just US government websites? Australian government websites do this too.

eg: http://operational.humanservices.gov.au/robots.txt 

that website clearly details how our governments social security system works, which changes and leaves the public at a disadvantage.

								
				Mark Graham Post authorApril 25, 2017 at 6:48 pm				

				
				Yes, in general terms we think information produced by governments around the world, and published via public websites, should be preserved and made available via the Wayback Machine.

								
				Andre Borie April 21, 2017 at 10:20 am				

				
				I really don’t see any problem with this – if a human can access it, so should the archive be able to – anyone who doesn’t want their stuff being searchable/archived online should just put a password on it.

The only good thing about robots.txt is the rate-limiting, so smaller sites can limit the bandwidth allocated to crawling if they wish.

By the way, what does this mean for previously-archived sites that now changed their robots.txt to block the Archive? Do you still keep the original data, and in which case, would you be able to restore access to it? I’ve seen a few sites where they used to be accessible on the Wayback machine but are not anymore due to a recent robots.txt change, and I’d love to see them available again if the original data wasn’t deleted.

								
				Jim Moores April 21, 2017 at 11:40 am				

				
				I’ve found that I can’t access archive material that I myself created because I let a domain I was no longer using expire and now it has a non-permissive robots.txt.  At a minimum archive.org needs to respect the robots.txt only at the point of collection, but my personal opinion is that it should be ignored completely by archive.org and allow people to actively opt out in some other way.

								
				Mo April 21, 2017 at 2:35 pm				

				
				In my case I was trying to retrieve an old web site of mine

A cybersquatter later bought the domain and put up a robits.txt

Now I can’t see my own site

The archive respects a new robots.txt file iwned by a squtter who is effectively blocking a historical archive they had NOTHING to do with. That is INSANE.

								
				Mark Graham Post authorApril 25, 2017 at 6:46 pm				

				
				Thank you Mo.

People write to us about the situation you describe every day.  In many cases they implore us to make their content available again.  This is exactly the harm we wish to address here.

And, everyone, please remember you can always write to info@archive.org if you would like us to not crawl your site.

								
				Ryan April 21, 2017 at 10:06 pm				

				
				Just this morning ia_archiver submitted a form on my site (the form was blank, but the point is that it clicked submit). Any crawler that submits forms is a jerk crawler. Would you consider redesigning your crawler to be less offensive?

								
				Mark Graham Post authorApril 25, 2017 at 6:42 pm				

				
				The “ia_archiver” User Agent is used by Alexa Internet, not the Internet Archive.

								
				Henrik April 23, 2017 at 9:54 am				

				
				On tools.ietf.org, all the information is public.  I use robots.txt primarily to steer web crawlers away from pages which require substantial CPU resources to generate.

Background: tools.ietf.org has been a pro-bono activity for 15 years, and runs on donated hardware; I don’t have the means to  upgrade to a level of CPU resources to be able to serve generated pages at the rate the searchbots can hit them.  The pages I steer robots away from are for instance source repository diffs, logs, commits etc., served through Trac.

If a crawler is sufficiently gentle, and is able to back down the rate of crawl if the time to serve pages is long or go up, I’m perfectly happy to have all of the pages now denied by my robots.txt crawled.

								
				Chris April 23, 2017 at 6:38 pm				

				
				I’d implore you to consider recognizing an “archive.txt”-like standard then. For people like myself who maintain a personal website, I tend to use it as a file server and would be quite annoyed if my resume (which contains an email address and contact phone number) ended up archived.

The alternative would be I remove everything I don’t want archived. I don’t think that’s your intended goal, so please rethink this strategy.

								
				Mark Graham Post authorApril 25, 2017 at 6:39 pm				

				
				Thank you for this Chris.  Please do write to us at info@archive.org about any sites you manage.  I promise we will be responsive.

								
				Ross April 23, 2017 at 8:13 pm				

				
				Internet Archive, thank you for wanting to archive the web as users see it, which is the whole point of “Saving the Web!” I had respect for robots.txt 20 year ago, but it’s today clear that we cannot allow site owners to affect the public record by their own selfish choices. Stay the course, thanks again!

								
				vinz April 24, 2017 at 2:20 am				

				
				alas, this comes too late for many of my favourite sites….2014-2015 took out a lot for some reason, as did mid-2008

I guess I’ll have to hold out until computers can reconstruct things straight from memory then do a big ol’ rip.

also wish I knew why it doesn’t save images properly sometimes, I run into a lot of those at self-hosted sites, unless the crawler just happened to hit it while a file was broken.

								
				Darren Duncan April 24, 2017 at 8:50 am				

				
				This is a good move on the part of the Internet Archive in principle.

At the very least, something I remember requesting of the Internet Archive years ago, is that any respect they give robots.txt should be time sensitive.

If a domain’s robots.txt allows archiving in the present, then the Internet Archive should always make today’s version of that content available in perpetuity, even if tomorrow’s robots.txt for that domain denies archiving.

I would want any website I operate to be archived, and if I gave up any of my domain names in the future, I would not want the future owners of those domain names to be able to cause the Internet Archive to stop displaying the versions of the domain that existed while I controlled it.

								
				Vix April 24, 2017 at 9:40 am				

				
				“Archiving relying (..) more on representing the web as it really was, and is, from a user’s perspective.”

I agree 100%. Robots.txt aren’t limiting regular users and archiving purpose is to reflect the users’ perspective, not SEO crawling. Go for it!

								
				Michael Martinez April 24, 2017 at 3:29 pm				

				
				If you ignore “robots.txt” directives people will find other ways to block you.  While it’s unfortunate that you don’t keep data live after a “robots.txt” change, that is your own bad policy.  The Robots Exclusion “standard” is NOT a standard, it’s an arbitrary and voluntary set of guidelines.  No one forced the archive to take content offline after domain names changed hands.  You can easily correct that bad practice by changing your policy rather than blaming the non-standard “standard” (of which MOST PEOPLE are unaware) for the issue.

While you’re fixing the problems with your system, you could also make it easier for Webmasters who do know about both the “robots.txt” file and your archive to correct errors rather than have to wait 24 hours or longer for your crawler to see changes.

								
				Mark Graham Post authorApril 25, 2017 at 6:37 pm				

				
				Thank you Michael. We encourage people to write to us at info@archive.org to report bugs, make requests (include for content to be removed from the Wayback Machine and for sites to not be crawled.)  I assure you we read every message sent to us, and act on them as appropriate.  Many of the features we add, and bugs we fix, are a direct result of user feedback.

								
		Pingback: The Internet Archive and robots.txt — Pixel Envy 

				
				nascent April 25, 2017 at 1:04 pm				

				
				There still needs to be a way of specifically preventing IA from archiving a domain.

								
				Mark Graham Post authorApril 25, 2017 at 6:33 pm				

				
				Please know that site owners can always write to info@archive.org and request that content from a site be removed from the Wayback Machine and from future crawling.  We process requests like that every day.

								
		Pingback: Editors’ Choice: Robots.txt 

				
				Adam April 25, 2017 at 6:25 pm				

				
				Its bad enough that robots.txt not only prevents archiving, it also deletes the entire achieve (in other words if you archive it, and later employs robots.txt, will delete it). Including a website being hacked to include robots.txt.

								
				John April 26, 2017 at 3:23 pm				

				
				Then please explain how I can keep a site ephemeral, as intended. Are there IP address ranges, HTTP headers, etc, that can be used to forbid access? What is the way to reliably keep sites out of the archive for the time you respected robots.txt, now, and forever? My sites explicitly tell robots “NOARCHIVE”. You shouldn’t even have the files on your systems. Retroactively making archives public is a dick move.

								
				Mark Graham Post authorApril 26, 2017 at 3:54 pm				

				
				Hi John,

Please email your request to info@archive.org and we will promptly process it.

								
				Chris Haines April 26, 2017 at 3:33 pm				

				
				“We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.” 

– I agree.  I think times have changed, and this reflects what users want from a web archive these days.

								
				Comments are closed.

		
					Search for:
					
					
				Recent Posts
			
					Holiday Cheer for a Good Cause
									
	
					Library Digital Lending Empowers People Worldwide During COVID-19 Pandemic
									
	
					Leaders in the Open World, Intellectual Property, and Social Justice Join Our Public Domain Day Celebration
									
	
					Dweb Meetup: The Latest in the DWeb Ecosystem
									
	
					Community Webs Program Receives $1,130,000 Andrew W. Mellon Foundation Award for a National Network of Public Libraries Building Local History Web Archives
									

		Recent Comments
	giso on Discogs Thank You!  A commercial community site with bulk data access
	Alvaro Muñoz Brandon on Contest: The Internet Archive is Looking For Creative Short Films Made By You!
	doug on Discogs Thank You!  A commercial community site with bulk data access
	Skevos Mavros on Contest: The Internet Archive is Looking For Creative Short Films Made By You!
	seo on Discogs Thank You!  A commercial community site with bulk data access

Categories
			78rpm

	Announcements

	Archive Version 2

	Archive-It

	Audio Archive

	Books Archive

	Cool items

	Education Archive

	Emulation

	Event

	Image Archive

	Jobs

	Lending Books

	Live Music Archive

	Movie Archive

	Music

	News

	Newsletter

	Open Library

	Past Event

	Software Archive

	Technical

	Television Archive

	Upcoming Event

	Video Archive

	Wayback Machine – Web Archive

	Web & Data Services


			Archives
		Archives
		Select Month
 December 2020 
 November 2020 
 October 2020 
 September 2020 
 August 2020 
 July 2020 
 June 2020 
 May 2020 
 April 2020 
 March 2020 
 February 2020 
 January 2020 
 December 2019 
 November 2019 
 October 2019 
 September 2019 
 August 2019 
 July 2019 
 June 2019 
 May 2019 
 April 2019 
 March 2019 
 February 2019 
 January 2019 
 December 2018 
 November 2018 
 October 2018 
 September 2018 
 August 2018 
 July 2018 
 June 2018 
 May 2018 
 April 2018 
 March 2018 
 February 2018 
 January 2018 
 December 2017 
 November 2017 
 October 2017 
 September 2017 
 August 2017 
 July 2017 
 June 2017 
 May 2017 
 April 2017 
 March 2017 
 February 2017 
 January 2017 
 December 2016 
 November 2016 
 October 2016 
 September 2016 
 August 2016 
 July 2016 
 June 2016 
 May 2016 
 April 2016 
 March 2016 
 February 2016 
 January 2016 
 December 2015 
 November 2015 
 October 2015 
 September 2015 
 July 2015 
 June 2015 
 May 2015 
 April 2015 
 March 2015 
 February 2015 
 January 2015 
 December 2014 
 November 2014 
 October 2014 
 September 2014 
 August 2014 
 July 2014 
 June 2014 
 May 2014 
 April 2014 
 March 2014 
 February 2014 
 January 2014 
 December 2013 
 November 2013 
 October 2013 
 September 2013 
 August 2013 
 July 2013 
 June 2013 
 May 2013 
 April 2013 
 March 2013 
 February 2013 
 January 2013 
 December 2012 
 November 2012 
 October 2012 
 September 2012 
 August 2012 
 July 2012 
 June 2012 
 May 2012 
 April 2012 
 March 2012 
 February 2012 
 January 2012 
 December 2011 
 November 2011 
 October 2011 
 September 2011 
 August 2011 
 July 2011 
 June 2011 
 May 2011 
 April 2011 
 March 2011 
 February 2011 
 January 2011 
 December 2010 
 November 2010 
 October 2010 
 September 2010 
 August 2010 
 July 2010 
 June 2010 
 May 2010 
 April 2010 
 March 2010 
 February 2010 
 January 2010 
 December 2009 
 October 2009 
 September 2009 
 August 2009 
 July 2009 
 June 2009 
 May 2009 
 April 2009 
 March 2009 
 February 2009 
 January 2009 
 December 2008 
 November 2008 
 September 2008 
 August 2008 
 June 2008 
 May 2008 
 April 2008 
 March 2008 
 February 2008 
 January 2008 
 December 2007 
 October 2007 
 September 2007 
 August 2007 
 July 2007 
 June 2007 
 May 2007 
 April 2007 
 March 2007 
 February 2007 
 January 2007 
 December 2006 
 November 2006 
 October 2006 
 September 2006 
 August 2006 
 February 2006 
 November 2005 
 October 2005 
 September 2005 
 March 2005 
 February 2005 
 January 2005 
 December 2004 
 October 2004 
 March 2004 


		Meta
				Log in
	Entries feed
	Comments feed
	WordPress.org


				Proudly powered by WordPress