Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs Internet Archive Blogs A blog from the team at archive.org Menu Skip to content Blog Announcements Internet Archive Store archive.org About Events Developers Donate Robots.txt meant for search engines don’t work well for web archives Posted on April 17, 2017 by Mark Graham Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search. Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!)  Some crawlers, like Google, pay attention to robots.txt directives, while others do not. Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes.  Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.  We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine.  In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore.  We receive inquiries and complaints on these “disappeared” sites almost daily. A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign.  We are now looking to do this more broadly.   We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective. Posted in Announcements, News | 34 Replies Post navigation ← A Few Advanced Search Tips DRM for the Web is a Bad Idea → 34 thoughts on “Robots.txt meant for search engines don’t work well for web archives” Daniel April 17, 2017 at 2:18 pm So the plan is to no longer respect robots.txt files with directives that explicitly say User-Agent: ia_archiver? or User-Agent: *? Many website explicitly block the Internet Archive’s ia_archiver crawler while allowing other crawlers. Have you considered adopting AppleNewsBot’s policy of pretending to be Googlebot? Robots directives written for Googlebot is more permissive than rules for other crawlers. Also, many sites block everything but Googlebot. I’m kind of torn on whether I think it’s a good thing if you improve the archive by ignoring the wishes of webmasters or not. I often run into issues with pages missing from the archive only to discover that the website has specifically excluded the ia_archiver. However, I still believe it’s important to preserve a standardized mechanism for controlling crawlers and bots of all kinds. Pingback: News Roundup | LJ INFOdocket Joshua April 17, 2017 at 11:31 pm A better choice would probably have been to respect robots.txt as of the time you crawled it; that is once archived changing robots.txt later doesn’t change its visibility. Oh well. MeditateOrDie April 18, 2017 at 7:02 am I agree with this mostly, though some sites seem to block archive.org for no sensible reason – then when their site eventually dies, as all sites do sooner or later, all of that good, useful info is lost forever. Perhaps giving more advanced users a means of over-riding robots.txt on a per-save and per-read basis while still keeping the default behaviors active for general use, might be a compromise that satisfies the needs of dutiful archivers, general users and webmasters. Something like a URL modification could be used to do the trick (this used to work in the past until this handy undocumented functionality was removed). eg: Additions of varying numbers of “.” in the right parts of a URL used to work nicely for saves and reads, until fairly recently. There’s not much point in us ‘saving the web’ if we human beings cannot access the archives because of our robot[s.txt] overlords! https://web.archive.org/web/https://u.cubeupload.com/ZkJ7hq.gif 🙂 Shannon April 18, 2017 at 4:22 pm Yay! Sites that have retroactively changed their robots.txt have caused me a *lot* of problems for (recent) historical research. The info is often still out there, but it can take additional hours to find it. And sometimes it’s just gone instead. It never made any sense to block access to the archives today when the sites had allowed your crawling yesterday. Adam April 19, 2017 at 12:55 pm There is another problem with robots.txt that everyone on here should know. There are 2 ways that archive could accidently be removed: -If a website goes dead, and a different web owner opens up his/her website with the same URL as the dead one with robots.txt, the entire archive, including the dead website, will be removed. Even if the new owner has nothing to own on the previous website. -If a site gets hacked to have robots.txt, the same thing happened as above. Web archive should change their system so that the machine that checks robots.txt would not remove already-archived pages, only disable the process of archiving of the page after robots.txt is established. To remove all history, would require to talk to archive.org to “flag” that site so that users cannot archive it. Also have a protection so that if web owners tries to remove history that isn’t theirs, would deny them (they are required to show proof of it) from deleting the dead site. Please let me know by sending an email to Adomsyik@gmail.com. MKKLSDKLS April 19, 2017 at 9:16 pm As someone who has seen many good website be “park-nuked” and kicked out of the publicly accessible Archive, I beg you people to ignore the parked website robots.txt’s wishes. If we really want to archive the web, people who have literally no relation to the previous website beyond usurping the previous name via domain-squatting should have no say in what is archived. Andy L April 20, 2017 at 6:56 pm I’m happy to hear about this change. I’ve always thought it was a shame that changes to the modern robot.txt files are able to reach back in time and scrub the site from existence. I guess I understand why that policy was put into place, but it doesn’t seem to make sense long term. For a convenient domain, its current website and owner might be completely unrelated to the historic page that was there before. Georgene Uddin April 20, 2017 at 7:26 pm Hey There. I discovered your weblog the usage of msn. That is an extremely well written article. I’ll make sure to bookmark it and come back to read extra of your useful information. Thank you for the post. I’ll definitely return.| posty April 21, 2017 at 8:20 am Could we expand this to more than just US government websites? Australian government websites do this too. eg: http://operational.humanservices.gov.au/robots.txt that website clearly details how our governments social security system works, which changes and leaves the public at a disadvantage. Mark Graham Post authorApril 25, 2017 at 6:48 pm Yes, in general terms we think information produced by governments around the world, and published via public websites, should be preserved and made available via the Wayback Machine. Andre Borie April 21, 2017 at 10:20 am I really don’t see any problem with this – if a human can access it, so should the archive be able to – anyone who doesn’t want their stuff being searchable/archived online should just put a password on it. The only good thing about robots.txt is the rate-limiting, so smaller sites can limit the bandwidth allocated to crawling if they wish. By the way, what does this mean for previously-archived sites that now changed their robots.txt to block the Archive? Do you still keep the original data, and in which case, would you be able to restore access to it? I’ve seen a few sites where they used to be accessible on the Wayback machine but are not anymore due to a recent robots.txt change, and I’d love to see them available again if the original data wasn’t deleted. Jim Moores April 21, 2017 at 11:40 am I’ve found that I can’t access archive material that I myself created because I let a domain I was no longer using expire and now it has a non-permissive robots.txt. At a minimum archive.org needs to respect the robots.txt only at the point of collection, but my personal opinion is that it should be ignored completely by archive.org and allow people to actively opt out in some other way. Mo April 21, 2017 at 2:35 pm In my case I was trying to retrieve an old web site of mine A cybersquatter later bought the domain and put up a robits.txt Now I can’t see my own site The archive respects a new robots.txt file iwned by a squtter who is effectively blocking a historical archive they had NOTHING to do with. That is INSANE. Mark Graham Post authorApril 25, 2017 at 6:46 pm Thank you Mo. People write to us about the situation you describe every day. In many cases they implore us to make their content available again. This is exactly the harm we wish to address here. And, everyone, please remember you can always write to info@archive.org if you would like us to not crawl your site. Ryan April 21, 2017 at 10:06 pm Just this morning ia_archiver submitted a form on my site (the form was blank, but the point is that it clicked submit). Any crawler that submits forms is a jerk crawler. Would you consider redesigning your crawler to be less offensive? Mark Graham Post authorApril 25, 2017 at 6:42 pm The “ia_archiver” User Agent is used by Alexa Internet, not the Internet Archive. Henrik April 23, 2017 at 9:54 am On tools.ietf.org, all the information is public. I use robots.txt primarily to steer web crawlers away from pages which require substantial CPU resources to generate. Background: tools.ietf.org has been a pro-bono activity for 15 years, and runs on donated hardware; I don’t have the means to upgrade to a level of CPU resources to be able to serve generated pages at the rate the searchbots can hit them. The pages I steer robots away from are for instance source repository diffs, logs, commits etc., served through Trac. If a crawler is sufficiently gentle, and is able to back down the rate of crawl if the time to serve pages is long or go up, I’m perfectly happy to have all of the pages now denied by my robots.txt crawled. Chris April 23, 2017 at 6:38 pm I’d implore you to consider recognizing an “archive.txt”-like standard then. For people like myself who maintain a personal website, I tend to use it as a file server and would be quite annoyed if my resume (which contains an email address and contact phone number) ended up archived. The alternative would be I remove everything I don’t want archived. I don’t think that’s your intended goal, so please rethink this strategy. Mark Graham Post authorApril 25, 2017 at 6:39 pm Thank you for this Chris. Please do write to us at info@archive.org about any sites you manage. I promise we will be responsive. Ross April 23, 2017 at 8:13 pm Internet Archive, thank you for wanting to archive the web as users see it, which is the whole point of “Saving the Web!” I had respect for robots.txt 20 year ago, but it’s today clear that we cannot allow site owners to affect the public record by their own selfish choices. Stay the course, thanks again! vinz April 24, 2017 at 2:20 am alas, this comes too late for many of my favourite sites….2014-2015 took out a lot for some reason, as did mid-2008 I guess I’ll have to hold out until computers can reconstruct things straight from memory then do a big ol’ rip. also wish I knew why it doesn’t save images properly sometimes, I run into a lot of those at self-hosted sites, unless the crawler just happened to hit it while a file was broken. Darren Duncan April 24, 2017 at 8:50 am This is a good move on the part of the Internet Archive in principle. At the very least, something I remember requesting of the Internet Archive years ago, is that any respect they give robots.txt should be time sensitive. If a domain’s robots.txt allows archiving in the present, then the Internet Archive should always make today’s version of that content available in perpetuity, even if tomorrow’s robots.txt for that domain denies archiving. I would want any website I operate to be archived, and if I gave up any of my domain names in the future, I would not want the future owners of those domain names to be able to cause the Internet Archive to stop displaying the versions of the domain that existed while I controlled it. Vix April 24, 2017 at 9:40 am “Archiving relying (..) more on representing the web as it really was, and is, from a user’s perspective.” I agree 100%. Robots.txt aren’t limiting regular users and archiving purpose is to reflect the users’ perspective, not SEO crawling. Go for it! Michael Martinez April 24, 2017 at 3:29 pm If you ignore “robots.txt” directives people will find other ways to block you. While it’s unfortunate that you don’t keep data live after a “robots.txt” change, that is your own bad policy. The Robots Exclusion “standard” is NOT a standard, it’s an arbitrary and voluntary set of guidelines. No one forced the archive to take content offline after domain names changed hands. You can easily correct that bad practice by changing your policy rather than blaming the non-standard “standard” (of which MOST PEOPLE are unaware) for the issue. While you’re fixing the problems with your system, you could also make it easier for Webmasters who do know about both the “robots.txt” file and your archive to correct errors rather than have to wait 24 hours or longer for your crawler to see changes. Mark Graham Post authorApril 25, 2017 at 6:37 pm Thank you Michael. We encourage people to write to us at info@archive.org to report bugs, make requests (include for content to be removed from the Wayback Machine and for sites to not be crawled.) I assure you we read every message sent to us, and act on them as appropriate. Many of the features we add, and bugs we fix, are a direct result of user feedback. Pingback: The Internet Archive and robots.txt — Pixel Envy nascent April 25, 2017 at 1:04 pm There still needs to be a way of specifically preventing IA from archiving a domain. Mark Graham Post authorApril 25, 2017 at 6:33 pm Please know that site owners can always write to info@archive.org and request that content from a site be removed from the Wayback Machine and from future crawling. We process requests like that every day. Pingback: Editors’ Choice: Robots.txt Adam April 25, 2017 at 6:25 pm Its bad enough that robots.txt not only prevents archiving, it also deletes the entire achieve (in other words if you archive it, and later employs robots.txt, will delete it). Including a website being hacked to include robots.txt. John April 26, 2017 at 3:23 pm Then please explain how I can keep a site ephemeral, as intended. Are there IP address ranges, HTTP headers, etc, that can be used to forbid access? What is the way to reliably keep sites out of the archive for the time you respected robots.txt, now, and forever? My sites explicitly tell robots “NOARCHIVE”. You shouldn’t even have the files on your systems. Retroactively making archives public is a dick move. Mark Graham Post authorApril 26, 2017 at 3:54 pm Hi John, Please email your request to info@archive.org and we will promptly process it. Chris Haines April 26, 2017 at 3:33 pm “We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.” – I agree. I think times have changed, and this reflects what users want from a web archive these days. Comments are closed. Search for: Recent Posts Holiday Cheer for a Good Cause Library Digital Lending Empowers People Worldwide During COVID-19 Pandemic Leaders in the Open World, Intellectual Property, and Social Justice Join Our Public Domain Day Celebration Dweb Meetup: The Latest in the DWeb Ecosystem Community Webs Program Receives $1,130,000 Andrew W. Mellon Foundation Award for a National Network of Public Libraries Building Local History Web Archives Recent Comments giso on Discogs Thank You! A commercial community site with bulk data access Alvaro Muñoz Brandon on Contest: The Internet Archive is Looking For Creative Short Films Made By You! doug on Discogs Thank You! A commercial community site with bulk data access Skevos Mavros on Contest: The Internet Archive is Looking For Creative Short Films Made By You! seo on Discogs Thank You! A commercial community site with bulk data access Categories 78rpm Announcements Archive Version 2 Archive-It Audio Archive Books Archive Cool items Education Archive Emulation Event Image Archive Jobs Lending Books Live Music Archive Movie Archive Music News Newsletter Open Library Past Event Software Archive Technical Television Archive Upcoming Event Video Archive Wayback Machine – Web Archive Web & Data Services Archives Archives Select Month December 2020 November 2020 October 2020 September 2020 August 2020 July 2020 June 2020 May 2020 April 2020 March 2020 February 2020 January 2020 December 2019 November 2019 October 2019 September 2019 August 2019 July 2019 June 2019 May 2019 April 2019 March 2019 February 2019 January 2019 December 2018 November 2018 October 2018 September 2018 August 2018 July 2018 June 2018 May 2018 April 2018 March 2018 February 2018 January 2018 December 2017 November 2017 October 2017 September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 October 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 March 2010 February 2010 January 2010 December 2009 October 2009 September 2009 August 2009 July 2009 June 2009 May 2009 April 2009 March 2009 February 2009 January 2009 December 2008 November 2008 September 2008 August 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 February 2006 November 2005 October 2005 September 2005 March 2005 February 2005 January 2005 December 2004 October 2004 March 2004 Meta Log in Entries feed Comments feed WordPress.org Proudly powered by WordPress