Archive

Author Archive

website transition

April 29th, 2017 No comments

The website has been transitioned to Dreamhost. The downloads remain at George Mason University and can be reached at http://downloads.digitalcorpora.org/corpora/ for the corpora and http://downloads.digitalcorpora.org/downloads/ for files.

Categories: General Tags:

“non-deterministic” USB image contributed

May 27th, 2014 No comments

We are happy to announce the contribution of four disk images of a non-deterministic USB drive. Read More.

Categories: General Tags:

Announcing New File Type Sample Files

February 5th, 2014 No comments

UT San Antonio has kindly provided digitalcorpora with open source, publicly releasable samples of 32 file types. These are the samples that were used by Dr. Nicole Beebe to develop the Sceadan File Type Classifier.

Included file types are ASP, AVI, B64, B85, BZ2, CSS, DLL, ELF, EXE, EXT3, FAT, FLV, JAR, JB2, JS, M4A, MOV, MP3, MP4, NTFS, PST, RPM, RTF, Random, SWF, TXT, Tbird, URL, WAV, WMA, XLSX, ZIP. Each file type sample can be downloaded from the website:
* http://digitalcorpora.org/corp/nps/files/filetypes1/

Also included is a _README directory that includes a list of every file downloaded and a copyright statement for the files that are covered under copyright. You can access that directory at:
* http://digitalcorpora.org/corp/nps/files/filetypes1/_README/

This “FLETYPES1” corpus supplements the files in the GOVDOCS1 corpus.

Please let us know if you use these by including this citation in your paper:

“FILETYPES1 File type samples,” Beebe, Nicole, University of Texas, San Antonio, hosted at http://digitalcorpora.org/corp/nps/files/filetypes1/. 2014

Categories: Files, General Tags:

Malware Scan of Govdocs1 now available

August 15th, 2013 No comments

A malware scan of thegovdocs1 corpus is now available at http://digitalcorpora.org/corp/nps/files/govdocs1/MetascanClientLog_201306281214.txt

 

Categories: General Tags:

35GB of JPEGs ready for download

March 7th, 2012 2 comments

We have created a tar and a ZIP file with 109,223 files from the govdocs1m corpus. You can download them from:

http://digitalcorpora.org/corp/nps/files/govdocs1/files.jpeg.tar   [37.6 GB]

http://digitalcorpora.org/corp/nps/files/govdocs1/files.jpeg.zip   [36.8 GB]

Please note that the ZIP file is necessarily a ZIP-64 file and will not decompress with the ZIP implementation built-in to MacOS or Windows.

Categories: Files Tags:

M57-Jean Scenario Posted

February 8th, 2011 No comments

The scenario page for M57-Jean has now been posted.

Categories: Scenarios Tags:

test disk image of emails available

February 2nd, 2011 2 comments

I have created a new disk image called 2010-nps-emails that can be used for testing programs that find email addresses or perform string search.

The disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.

Below are a list of the email addresses and their codings:

email address                             Application (Encoding)

plain_text@textedit.com                   Apple TextEdit  (UTF-8)
plain_text_pdf@textedit.com               Apple TextEdit print-to-PDF (/FlateDecode)
rtf_text@textedit.com                     Apple TextEdit (RTF)
rtf_text_pdf@textedit.com                 Apple TextEdit print-to-PDF (/FlateDecode)
plain_utf16@textedit.com                  Apple TextEdit (UTF-16)
plain_utf16_pdf@textedit.com              Apple TextEdit print-to-PDF (/FlateDecode)

pages@iwork09.com                         Apple Pages '09
pages_comment@iwork09.com                 Apple Pages (comment) '09
keynote@iwork09.com                       Apple Keynote '09
keynote_comment@iwork09.com               Apple Keynote '09 (comment)
numbers@iwork09.com                       Apple Numbers '09
numbers_comment@iwork09.com               Apple Numbers '09 (comment)

user_doc@microsoftword.com                Microsoft Word 2008 (Mac) (.doc file)
user_doc_pdf@microsoftword.com            Microsoft Word 2008 (Mac) print-to-PDF
user_docx@microsoftword.com
user_docx_pdf@microsoftword.com           Microsoft Word 2008 (Mac) print-to-PDF (.docx file)
xls_cell@microsoft_excel.com
xls_comment@microsoft_excel.com           Microsoft Word 2008 (Mac)
xlsx_cell@microsoft_excel.com             Microsoft Word 2008 (Mac)
xlsx_comment@microsoft_excel.com          Microsoft Word 2008 (Mac) (Comment)

doc_within_doc@document.com               Microsoft Word 2007 (OLE .doc file within .doc)
docx_within_docx@document.com             Microsoft Word 2007 (OLE .doc file within .doc)
ppt_within_doc@document.com               Microsoft PowerPoint and Word 2007 (OLE .ppt file within .doc)
pptx_within_docx@document.com             Microsoft PowerPoint and Word 2007 (OLE .pptx file within .docx)
xls_within_doc@document.com               Microsoft Excel and Word 2007 (OLE .xls file within .doc)
xlsx_within_docx@document.com             Microsoft Excel and Word 2007 (OLE .xlsx file within .docx)

email_in_zip@zipfile1.com                 text file within ZIP
email_in_zip_zip@zipfile2.com             ZIP'ed text file, ZIP'ed
email_in_gzip@gzipfile.com                text file within GZIP
email_in_gzip_gzip@gzipfile.com           GZIP'ed text file, GZIP'ed

The image can be downloaded from http://digitalcorpora.org/corp/nps/drives/nps-2010-emails/

Edit, 2011-11-26 19:32 PST: One email was incorrectly recorded above. xlsx_comment@microsoft_excel.com is within the disk image, but xlsx_cell_comment@microsoft_excel.com was recorded here. That is now corrected above.

Categories: Disk Images Tags:

First 512 and 4096 byte block hashes of govdocs1

January 4th, 2011 No comments

I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from http://digitalcorpora.org/corp/nps/files/govdocs1/govdocs1-first512-first4096-docid.txt

Categories: Files Tags:

Bots downloading disk images

December 27th, 2010 1 comment

I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.

Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.

  Rank     Count     Value(s):
  ============================
      1      2334      Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      2       851      MLBot (www.metadatalabs.com/mlbot)
      3       811      SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
      4       749      Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
      5       492      Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
      6       130      Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
      7       115      Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/)
      8       109      msnbot/2.0b (+http://search.msn.com/msnbot.htm)
      9       108      Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
     10        89      CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
     11        87      Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
     12        78      TwengaBot-Discover (http://www.twenga.fr/bot-discover.html)
     13        58      Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
     14        51      msnbot/1.1 (+http://search.msn.com/msnbot.htm)
     15        26      Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
     16        21      Cityreview Robot (+http://www.cityreview.org/crawler/)
     17        18      'citeseerxbot'
     18        15      SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot)
     19        12      Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
     20        11      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
     21         9      Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
     22         7      CatchBot/3.0; +http://www.catchbot.com
                7      CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp)
                7      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
     25         6      Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php)
                6      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
     27         5      MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
                5      yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html
                5      yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html
     30         3      msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
     31         2      Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9
                2      yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
                2      yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html
                2      yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html
     35         1      Mozilla/5.0 (compatible; Googlebot/2.1;  http://www.google.com/bot.html)
                1      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
                1      findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
                1      librabot/1.0 (+http://search.msn.com/msnbot.htm)
                1      yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html
                1      yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html 

Total items printed: 6242
Categories: Stats Tags:

M57-Patents Scenario is Available

December 10th, 2010 No comments

The M57-Patents scenario is now available. This scenario includes nearly a terabyte of information with 50 disk images, memory dumps, and network packets. There are three specific crimes in the scenario that can be solved, but there are also collections of data that can be used to enable a variety of computer forensics research projects and tool development.

The scenario is split up into many pieces so you can download just what you need.

You can download it from

Categories: Disk Images, Scenarios Tags:
"This material is based upon work supported by the National Science Foundation under Grant No. 0919593. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."