First 512 and 4096 byte block hashes of govdocs1

January 4th, 2011 No comments

I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from http://digitalcorpora.org/corp/nps/files/govdocs1/govdocs1-first512-first4096-docid.txt

Categories: Files Tags:

Bots downloading disk images

December 27th, 2010 1 comment

I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.

Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.

  Rank     Count     Value(s):
  ============================
      1      2334      Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      2       851      MLBot (www.metadatalabs.com/mlbot)
      3       811      SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
      4       749      Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
      5       492      Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
      6       130      Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
      7       115      Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/)
      8       109      msnbot/2.0b (+http://search.msn.com/msnbot.htm)
      9       108      Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
     10        89      CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
     11        87      Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
     12        78      TwengaBot-Discover (http://www.twenga.fr/bot-discover.html)
     13        58      Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
     14        51      msnbot/1.1 (+http://search.msn.com/msnbot.htm)
     15        26      Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
     16        21      Cityreview Robot (+http://www.cityreview.org/crawler/)
     17        18      'citeseerxbot'
     18        15      SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot)
     19        12      Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
     20        11      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
     21         9      Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
     22         7      CatchBot/3.0; +http://www.catchbot.com
                7      CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp)
                7      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
     25         6      Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php)
                6      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
     27         5      MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
                5      yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html
                5      yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html
     30         3      msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
     31         2      Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9
                2      yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
                2      yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html
                2      yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html
     35         1      Mozilla/5.0 (compatible; Googlebot/2.1;  http://www.google.com/bot.html)
                1      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
                1      findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
                1      librabot/1.0 (+http://search.msn.com/msnbot.htm)
                1      yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html
                1      yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html 

Total items printed: 6242
Categories: Stats Tags:

M57-Patents Scenario is Available

December 10th, 2010 No comments

The M57-Patents scenario is now available. This scenario includes nearly a terabyte of information with 50 disk images, memory dumps, and network packets. There are three specific crimes in the scenario that can be solved, but there are also collections of data that can be used to enable a variety of computer forensics research projects and tool development.

The scenario is split up into many pieces so you can download just what you need.

You can download it from

Categories: Disk Images, Scenarios Tags:

Nitroba University Scenario Available

December 9th, 2010 1 comment

The Nitroba University Harassment Scenario is now available.

Categories: Scenarios Tags:

Announcing GOVDOCS1.1

December 4th, 2010 No comments

As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.

We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:

  Rank     Count     Value(s):
  ============================
      1     77227      .text -> .txt  
      2      9290      .xml -> .html  
      3      3683      .pdf -> .html  
      4      3565      . -> .html  
      5      2602      . -> .unk  
      6      2601      .xls -> .dbase3  
      7      2082      .text -> .unk  
      8      1943      . -> .pdf  
      9      1942      .text -> .html  
     10      1857      .doc -> .html  
     11      1088      .doc -> .rtf  
     12       620      .xls -> .html  
     13       595      .text -> .f  
     14       533      .text -> .xml  
     15       459      .ppt -> .html  
     16       438      .xls -> .txt  
     17       435      .doc -> .txt  
     18       346      .doc -> .wp  
     19       283      .txt -> .html  
     20       269      .eps -> .html  
     21       256      .log -> .html  
     22       253      .doc -> .unk  
     23       228      .swf -> .html  
     24       218      .xls -> .unk  
     25       179      .text -> .fits  
     26       175      .dwf -> .html  
     27       166      .gz -> .html  
     28       163      .sql -> .html  
     29       161      .text -> .tex  
     30       155      .html -> .xml  
     31       107      .html -> .pdf  
     32        96      .text -> .troff  
     33        94      .ps -> .html  
     34        70      .js -> .html  
     35        66      . -> .xml  
     36        60      .xls -> .gls  
     37        59      .ttf -> .txt  
     38        53      .text -> .sgml  
     39        45      .jpg -> .html  
     40        36      .ppt -> .txt  
     41        35      .csv -> .html  
               35      .ttf -> .html  
     43        30      .ppt -> .unk  
     44        29      .text -> .pdf  
               29      .xbm -> .txt  
     46        26      .java -> .html  
               26      .zip -> .html  
     48        25      .doc -> .fm  
     49        22      .text -> .rtf  
     50        21      .pub -> .html  
     51        20      .js -> .txt  
     52        17      .jar -> .html  
               17      .jar -> .txt  
               17      .text -> .gz  
     55        16      .ps -> .pdf  
     56        15      .ppt -> .doc  
     57        14      .text -> .swf  
               14      .tmp -> .html  
               14      .xbm -> .html  
     60        13      .doc -> .pdf  
               13      .doc -> .troff  
     62         9      .pps -> .html  
                9      .xlsx -> .html  
     64         8      .log -> .txt  
     65         7      . -> .rtf  
                7      .dll -> .html  
                7      .kml -> .html  
                7      .xls -> .wk1  (Lotus Notes)  
     69         6      .doc -> .f  
                6      .kmz -> .html  
                6      .xml -> .txt  
     72         5      . -> .txt  
                5      .doc -> .sgml  
                5      .docx -> .html  
                5      .eps -> .pdf  
                5      .exe -> .html  
                5      .html -> .rtf  
     78         4      .doc -> .ileaf  (Interleaf)  
                4      .ppt -> .zip  
                4      .pptx -> .html  
                4      .text -> .doc  
                4      .text -> .kml  
                4      .xls -> .zip  
     84         3      .bmp -> .html  
                3      .jpeg -> .html  
                3      .ppt -> .sgml  
                3      .text -> .wp  
                3      .tif -> .html  
                3      .xls -> .doc  
                3      .xls -> .xml  
     91         2      .exported -> .html  
                2      .ppt -> .appledouble (AppleDouble encoded Macintosh file  )
                2      .ppt -> .odp
                2      .ppt -> .gd
                2      .tmp -> .xml  
                2      .xls -> .123
                2      .xls -> .lnk (MS Windows shortcut  )
                2      .xls -> .pdf  
     99         1      .csv -> .rtf  
                1      .doc -> .par 
                1      .doc -> .zip
                1      .doc -> .fits  
                1      .doc -> .gz  
                1      .doc -> .icns  
                1      .doc -> .tex  
                1      .doc -> .xls  
                1      .doc -> .xml  
                1      .docx -> .pdf  
                1      .hlp -> .html  
                1      .hmtl -> .html  
                1      .html -> .gif  
                1      .html -> .kml  
                1      .kml -> .xml  
                1      .pdf -> .xml  
                1      .ppt -> .pdf  
                1      .sql -> .txt  
                1      .sys -> .rtf  
                1      .wp -> .pdf  
                1      .wp -> .rtf  
                1      .xls -> .wk3
                1      .xls -> .bin  (mc68020 pure executable  )
                1      .xls -> .f  
                1      .xls -> .sgml  
                1      .xml -> .kml  

We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.

Categories: Files Tags:

Forensic Innovations, Inc., analyzes the million file corpus

June 18th, 2010 No comments

Forensic Innovations, Inc., makers of File Investigator TOOLS, has performed an analysis of the 986,278 files in the “1 million file corpus”. (13,722 files in the corpus were removed earlier this year because they were from California State Government web servers that were in the .gov domain and mistakingly collected as part of the original collection effort.)

We would like to thank Forensic Innovations for their work in support of this project. We have made available their summary report and will be making available their file-by-file analysis as soon as we deploy an appropriate database on this website.

Categories: General Tags:

Open Source Forensics Conference

March 21st, 2010 No comments

We will be making a presentation and handing out DVDs filled with data at the Open Source Forensics Conference, held in conjunction with the Basis Technology Government User’s Conference, June 8-9, 2010, at the Westfield Marriott in Chantilly, VA.

Basis Open Source Forensics Conference. June 9, 2010. Held in conjunction with Basis Technology's Government User Conference

Categories: General Tags:

ISO 9660 disk images from anti-forensics.ru posted

March 8th, 2010 No comments

Our friends at anti-forensics.ru have given us seven very small disk images that are designed to demonstrate failings of particular open source Linux distributions.

You can view all of the images at http://digitalcorpora.org/corp/images/aor/. The images you will find there includes:

These images should be directly copied to a hard drive or a partition. Forensic Linux distributions would use them as root file systems and execute proof-of-concept code during the boot.

Details of why these images are useful can be found on the author’s website, at: http://www.computer-forensics-lab.org/pdf/Linux_for_computer_forensic_investigators_2.pdf

Categories: General Tags:

MySQL tables for NIST NSRL RDS 2.26 posted

March 8th, 2010 No comments

Ever want to have SQL access to the NIST RDS but didn’t want to spend a month building the MySQL tables? Well, we did too… So we took one of our 8-core, 32GB servers, imported all of the NSRL, and then put a tar file of the tables available for download on this server.

To use these files just download http://digitalcorpora.org/corp/nist/rds226.tar.bz2 and put the files in your MySQL data directory. You’ll be up-and-running in no time.

Categories: NIST Tags: ,

New Website

March 7th, 2010 No comments

We are revising the structure and content of this website. Please let us know if you find any problems.

Categories: General Tags:
"This material is based upon work supported by the National Science Foundation under Grant No. 0919593. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."