id author title date pages extension mime words sentences flesch summary cache txt blog-archive-org-5806 Robots.txt meant for search engines don’t work well for web archives - Internet Archive Blogs .html text/html 3437 266 79 We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). Internet Archive, thank you for wanting to archive the web as users see it, which is the whole point of "Saving the Web!" I had respect for robots.txt 20 year ago, but it's today clear that we cannot allow site owners to affect the public record by their own selfish choices. "We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user's perspective." ./cache/blog-archive-org-5806.html ./txt/blog-archive-org-5806.txt