I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.
Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.
Rank Count Value(s): ============================ 1 2334 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 2 851 MLBot (www.metadatalabs.com/mlbot) 3 811 SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/ (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 4 749 Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org) 5 492 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 6 130 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 7 115 Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/) 8 109 msnbot/2.0b (+http://search.msn.com/msnbot.htm) 9 108 Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/) 10 89 CCBot/1.0 (+http://www.commoncrawl.org/bot.html) 11 87 Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html) 12 78 TwengaBot-Discover (http://www.twenga.fr/bot-discover.html) 13 58 Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/) 14 51 msnbot/1.1 (+http://search.msn.com/msnbot.htm) 15 26 Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+) 16 21 Cityreview Robot (+http://www.cityreview.org/crawler/) 17 18 'citeseerxbot' 18 15 SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot) 19 12 Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+) 20 11 Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html 21 9 Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot) 22 7 CatchBot/3.0; +http://www.catchbot.com 7 CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp) 7 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html 25 6 Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php 6 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html 27 5 MSRBOT (http://research.microsoft.com/research/sv/msrbot/) 5 yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html 5 yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html 30 3 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm) 31 2 Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/ 2 yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html 2 yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html 2 yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html 35 1 Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html 1 Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html) 1 findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de) 1 librabot/1.0 (+http://search.msn.com/msnbot.htm) 1 yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html 1 yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html 1 yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html 1 yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html 1 yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html Total items printed: 6242
use robots.txt
use a index.php that act as a gateway and filter out bots with a script beforehand.