Announcing New File Type Sample Files

February 5th, 2014 No comments

UT San Antonio has kindly provided digitalcorpora with open source, publicly releasable samples of 32 file types. These are the samples that were used by Dr. Nicole Beebe to develop the Sceadan File Type Classifier.

Included file types are ASP, AVI, B64, B85, BZ2, CSS, DLL, ELF, EXE, EXT3, FAT, FLV, JAR, JB2, JS, M4A, MOV, MP3, MP4, NTFS, PST, RPM, RTF, Random, SWF, TXT, Tbird, URL, WAV, WMA, XLSX, ZIP. Each file type sample can be downloaded from the website:

Also included is a _README directory that includes a list of every file downloaded and a copyright statement for the files that are covered under copyright. You can access that directory at:

This “FLETYPES1” corpus supplements the files in the GOVDOCS1 corpus.

Please let us know if you use these by including this citation in your paper:

“FILETYPES1 File type samples,” Beebe, Nicole, University of Texas, San Antonio, hosted at 2014

Categories: Files, General Tags:

Announcement: hashdb toolset

October 24th, 2013 No comments

The text file govdocs1-first512-first4096-docid.txt containing MD5 hashes of the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus has been removed.  This file was provided to assist with research of block hashes.  We have since created the hashdb toolset which provides support for creating and working with hash block databases.  Please refer to for downloading the code, continuing progress on this topic, and links to relevant papers including:

Distinct Sector Hashes for Target File Detection

A related masters thesis on this topic was completed at Naval Postgraduate School in 2012 and can be downloaded for additional reading:




Categories: General Tags:

Malware Scan of Govdocs1 now available

August 15th, 2013 No comments

A malware scan of thegovdocs1 corpus is now available at


Categories: General Tags:

Bulk Extractor News and Downloads

April 3rd, 2013 No comments

File contains the source code for bulk_extractor v1.3.1.  bulk_extractor is a C++ program that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures.  bulk_extractor is typically downloaded on a Fedora system and compiled or cross-compiled to Linux, Mac, or Windows using autotools.  Please see

BEViewer.jar is an executable bulk_extractor viewer user interface.
Bulk Extractor Viewer (BEViewer) provides a graphical user interface for browsing features that have been extracted via the bulk extractor feature extraction tool.  Please see

be_installer-1.3.exe is a Windows installer for installing bulk_extractor and BEViewer v1.3 on a Windows system.

bulk_extractor.pdf, “Digital media triage with bulk data analysis and bulk-extractor,” discusses how the bulk_extractor tool is effective in providing bulk data analysis.

2012-08-08 bulk_extractor Tutorial.pdf describes how to use the BEViewer tool.  Although some of the parameters for running bulk_extractor have changed, the majority of the tutorial remains current..

Source: The information above and links were received from Bruce Allen <>, Naval Postgraduate School

See other bulk_extractor downloads here:

Categories: General Tags:

Hash Codes

March 29th, 2013 No comments

The following post is now obsolete. The file frequent_hashcodes_and_paths_rdc.xml has been removed from the corpus as explained in a more recent post. Please see:

Deprecated Post from Mar 29, 2013 @ 13:13

The file frequent_hashcodes_and_paths_rdc.xml contains SHA1 hashcode and path data derived from the Real Drive Corpus collected by the DEEP Project at the U.S. Naval Postgraduate School. The file provides two kinds of data useful to forensic investigators: (1) SHA1 hashcodes that occurred for undeleted files on at least five different drives in the corpus but did not occur in the National Software Reference Library ( These are likely to indicate files uninteresting and excludable in most forensic investigations. File sizes and names are also given. (2) Path names (file name plus all directories) for paths that occurred on at least twenty different drives in the corpus on undeleted files. These usefully supplement the hashcodes in indicating recurring files uninteresting for investigators. However, occurrences of these files could include viruses and other malware, or could be hiding illegal content although it is unlikely.

Read more … XML File (HAS BEEN REMOVED):   (102 MB)

Categories: General Tags:

35GB of JPEGs ready for download

March 7th, 2012 2 comments

We have created a tar and a ZIP file with 109,223 files from the govdocs1m corpus. You can download them from:   [37.6 GB]

Browse all by type:

Please note that the ZIP file is necessarily a ZIP-64 file and will not decompress with the ZIP implementation built-in to MacOS or Windows.

Categories: Files Tags:

M57-Jean Scenario Posted

February 8th, 2011 No comments

The scenario page for M57-Jean has now been posted.

Categories: Scenarios Tags:

test disk image of emails available

February 2nd, 2011 4 comments

I have created a new disk image called 2010-nps-emails that can be used for testing programs that find email addresses or perform string search.

The disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.

Below are a list of the email addresses and their codings:

email address                             Application (Encoding)                   Apple TextEdit  (UTF-8)               Apple TextEdit print-to-PDF (/FlateDecode)                     Apple TextEdit (RTF)                 Apple TextEdit print-to-PDF (/FlateDecode)                  Apple TextEdit (UTF-16)              Apple TextEdit print-to-PDF (/FlateDecode)                         Apple Pages '09                 Apple Pages (comment) '09                       Apple Keynote '09               Apple Keynote '09 (comment)                       Apple Numbers '09               Apple Numbers '09 (comment)                Microsoft Word 2008 (Mac) (.doc file)            Microsoft Word 2008 (Mac) print-to-PDF           Microsoft Word 2008 (Mac) print-to-PDF (.docx file)           Microsoft Word 2008 (Mac)             Microsoft Word 2008 (Mac)          Microsoft Word 2008 (Mac) (Comment)               Microsoft Word 2007 (OLE .doc file within .doc)             Microsoft Word 2007 (OLE .doc file within .doc)               Microsoft PowerPoint and Word 2007 (OLE .ppt file within .doc)             Microsoft PowerPoint and Word 2007 (OLE .pptx file within .docx)               Microsoft Excel and Word 2007 (OLE .xls file within .doc)             Microsoft Excel and Word 2007 (OLE .xlsx file within .docx)                 text file within ZIP             ZIP'ed text file, ZIP'ed                text file within GZIP           GZIP'ed text file, GZIP'ed

The image can be downloaded from

Edit, 2011-11-26 19:32 PST: One email was incorrectly recorded above. is within the disk image, but was recorded here. That is now corrected above.

Categories: Disk Images Tags:

First 512 and 4096 byte block hashes of govdocs1

January 4th, 2011 No comments

I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from

Categories: Files Tags:

Bots downloading disk images

December 27th, 2010 1 comment

I’m preparing some statistics on who (and what) are downloading the disk images we have here at The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.

Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.

  Rank     Count     Value(s):
      1      2334      Mozilla/5.0 (compatible; Googlebot/2.1; +
      2       851      MLBot (
      3       811      SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/ (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +
      4       749      Mozilla/5.0 (compatible; DotBot/1.1;,
      5       492      Mozilla/5.0 (compatible; YandexBot/3.0; +
      6       130      Mozilla/5.0 (compatible; bingbot/2.0; +
      7       115      Mozilla/5.0 (compatible; DBLBot/1.0; +
      8       109      msnbot/2.0b (+
      9       108      Mozilla/5.0 (compatible; SiteBot/0.1; +
     10        89      CCBot/1.0 (+
     11        87      Mozilla/5.0 (Twiceler-0.9
     12        78      TwengaBot-Discover (
     13        58      Mozilla/5.0 (compatible; Purebot/1.1; +
     14        51      msnbot/1.1 (+
     15        26      Mozilla/5.0 (compatible; MJ12bot/v1.3.2;
     16        21      Cityreview Robot (+
     17        18      'citeseerxbot'
     18        15      SindiceBot (heritrix/2.0.2 +
     19        12      Mozilla/5.0 (compatible; MJ12bot/v1.3.1;
     20        11      Mozilla/5.0 (compatible; discobot/1.1; +
     21         9      Mozilla/5.0 (compatible; Exabot/3.0; +
     22         7      CatchBot/3.0; +
                7      CyberPatrol SiteCat Webbot (
                7      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en)
     25         6      Mozilla/5.0 (compatible; Search17Bot/1.1;
                6      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de)
     27         5      MSRBOT (
                5      yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en)
                5      yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en)
     30         3      msnbot-media/1.1 (+
     31         2      Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/
                2      yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en)
                2      yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en)
                2      yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en)
     35         1      Mozilla/5.0 (compatible; Googlebot/2.1;
                1      Mozilla/5.0 (compatible; discobot/1.1; +
                1 (Robot;
                1      librabot/1.0 (+
                1      yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en)
                1      yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de)
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de)
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de)
                1      yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) 

Total items printed: 6242
Categories: Stats Tags: