Bots downloading disk images

I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.

Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.

  Rank     Count     Value(s):
  ============================
      1      2334      Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      2       851      MLBot (www.metadatalabs.com/mlbot)
      3       811      SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
      4       749      Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
      5       492      Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
      6       130      Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
      7       115      Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/)
      8       109      msnbot/2.0b (+http://search.msn.com/msnbot.htm)
      9       108      Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
     10        89      CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
     11        87      Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
     12        78      TwengaBot-Discover (http://www.twenga.fr/bot-discover.html)
     13        58      Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
     14        51      msnbot/1.1 (+http://search.msn.com/msnbot.htm)
     15        26      Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
     16        21      Cityreview Robot (+http://www.cityreview.org/crawler/)
     17        18      'citeseerxbot'
     18        15      SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot)
     19        12      Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
     20        11      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
     21         9      Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
     22         7      CatchBot/3.0; +http://www.catchbot.com
                7      CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp)
                7      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
     25         6      Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php
                6      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
     27         5      MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
                5      yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html
                5      yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html
     30         3      msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
     31         2      Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9
                2      yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
                2      yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html
                2      yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html
     35         1      Mozilla/5.0 (compatible; Googlebot/2.1;  http://www.google.com/bot.html
                1      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
                1      findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
                1      librabot/1.0 (+http://search.msn.com/msnbot.htm)
                1      yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html
                1      yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html 

Total items printed: 6242

One Comment

  1. use robots.txt
    use a index.php that act as a gateway and filter out bots with a script beforehand.

Leave a Reply to alien Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.