The scenario page for M57-Jean has now been posted.
test disk image of emails available
I have created a new disk image called 2010-nps-emails that can be used for testing programs that find email addresses or perform string search.
The disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.
Below are a list of the email addresses and their codings:
email address Application (Encoding) plain_text@textedit.com Apple TextEdit (UTF-8) plain_text_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode) rtf_text@textedit.com Apple TextEdit (RTF) rtf_text_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode) plain_utf16@textedit.com Apple TextEdit (UTF-16) plain_utf16_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode) pages@iwork09.com Apple Pages '09 pages_comment@iwork09.com Apple Pages (comment) '09 keynote@iwork09.com Apple Keynote '09 keynote_comment@iwork09.com Apple Keynote '09 (comment) numbers@iwork09.com Apple Numbers '09 numbers_comment@iwork09.com Apple Numbers '09 (comment) user_doc@microsoftword.com Microsoft Word 2008 (Mac) (.doc file) user_doc_pdf@microsoftword.com Microsoft Word 2008 (Mac) print-to-PDF user_docx@microsoftword.com user_docx_pdf@microsoftword.com Microsoft Word 2008 (Mac) print-to-PDF (.docx file) xls_cell@microsoft_excel.com xls_comment@microsoft_excel.com Microsoft Word 2008 (Mac) xlsx_cell@microsoft_excel.com Microsoft Word 2008 (Mac) xlsx_comment@microsoft_excel.com Microsoft Word 2008 (Mac) (Comment) doc_within_doc@document.com Microsoft Word 2007 (OLE .doc file within .doc) docx_within_docx@document.com Microsoft Word 2007 (OLE .doc file within .doc) ppt_within_doc@document.com Microsoft PowerPoint and Word 2007 (OLE .ppt file within .doc) pptx_within_docx@document.com Microsoft PowerPoint and Word 2007 (OLE .pptx file within .docx) xls_within_doc@document.com Microsoft Excel and Word 2007 (OLE .xls file within .doc) xlsx_within_docx@document.com Microsoft Excel and Word 2007 (OLE .xlsx file within .docx) email_in_zip@zipfile1.com text file within ZIP email_in_zip_zip@zipfile2.com ZIP'ed text file, ZIP'ed email_in_gzip@gzipfile.com text file within GZIP email_in_gzip_gzip@gzipfile.com GZIP'ed text file, GZIP'ed
The image can be downloaded from nps-2010-emails
Edit, 2011-11-26 19:32 PST: One email was incorrectly recorded above. xlsx_comment@microsoft_excel.com is within the disk image, but xlsx_cell_comment@microsoft_excel.com was recorded here. That is now corrected above.
First 512 and 4096 byte block hashes of govdocs1
I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from https://downloads.digitalcorpora.org/corpora/files/govdocs1/govdocs1-first512-first4096-docid.txt
Bots downloading disk images
I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.
Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.
Rank Count Value(s): ============================ 1 2334 Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html) 2 851 MLBot (www.metadatalabs.com/mlbot) 3 811 SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +https://www.google.com/bot.html) 4 749 Mozilla/5.0 (compatible; DotBot/1.1; https://www.dotnetdotcom.org/, crawler@dotnetdotcom.org) 5 492 Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots) 6 130 Mozilla/5.0 (compatible; bingbot/2.0; +https://www.bing.com/bingbot.htm) 7 115 Mozilla/5.0 (compatible; DBLBot/1.0; +https://www.dontbuylists.com/) 8 109 msnbot/2.0b (+https://search.msn.com/msnbot.htm) 9 108 Mozilla/5.0 (compatible; SiteBot/0.1; +https://www.sitebot.org/robot/) 10 89 CCBot/1.0 (+https://www.commoncrawl.org/bot.html) 11 87 Mozilla/5.0 (Twiceler-0.9 https://www.cuil.com/twiceler/robot.html) 12 78 TwengaBot-Discover (https://www.twenga.fr/bot-discover.html) 13 58 Mozilla/5.0 (compatible; Purebot/1.1; +https://www.puritysearch.net/) 14 51 msnbot/1.1 (+https://search.msn.com/msnbot.htm) 15 26 Mozilla/5.0 (compatible; MJ12bot/v1.3.2; https://www.majestic12.co.uk/bot.php?+) 16 21 Cityreview Robot (+https://www.cityreview.org/crawler/) 17 18 'citeseerxbot' 18 15 SindiceBot (heritrix/2.0.2 +https://sindice.com/developers/bot) 19 12 Mozilla/5.0 (compatible; MJ12bot/v1.3.1; https://www.majestic12.co.uk/bot.php?+) 20 11 Mozilla/5.0 (compatible; discobot/1.1; +https://discoveryengine.com/discobot.html 21 9 Mozilla/5.0 (compatible; Exabot/3.0; +https://www.exabot.com/go/robot) 22 7 CatchBot/3.0; +https://www.catchbot.com 7 CyberPatrol SiteCat Webbot (https://www.cyberpatrol.com/cyberpatrolcrawler.asp) 7 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) https://yacy.net/bot.html 25 6 Mozilla/5.0 (compatible; Search17Bot/1.1; https://www.search17.com/bot.php 6 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) https://yacy.net/bot.html 27 5 MSRBOT (https://research.microsoft.com/research/sv/msrbot/) 5 yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) https://yacy.net/bot.html 5 yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) https://yacy.net/bot.html 30 3 msnbot-media/1.1 (+https://search.msn.com/msnbot.htm) 31 2 Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9 2 yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) https://yacy.net/bot.html 2 yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) https://yacy.net/bot.html 2 yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) https://yacy.net/bot.html 35 1 Mozilla/5.0 (compatible; Googlebot/2.1; https://www.google.com/bot.html 1 Mozilla/5.0 (compatible; discobot/1.1; +https://discoveryengine.com/discobot.html) 1 findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de) 1 librabot/1.0 (+https://search.msn.com/msnbot.htm) 1 yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) https://yacy.net/bot.html 1 yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) https://yacy.net/bot.html 1 yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) https://yacy.net/bot.html 1 yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) https://yacy.net/bot.html 1 yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) https://yacy.net/bot.html Total items printed: 6242
M57-Patents Scenario is Available
The M57-Patents scenario is now available. This scenario includes nearly a terabyte of information with 50 disk images, memory dumps, and network packets. There are three specific crimes in the scenario that can be solved, but there are also collections of data that can be used to enable a variety of computer forensics research projects and tool development.
The scenario is split up into many pieces so you can download just what you need.
You can download it from
Nitroba University Scenario Available
The Nitroba University Harassment Scenario is now available.
Announcing GOVDOCS1.1
As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.
We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:
Rank Count Value(s): ============================ 1 77227 .text -> .txt 2 9290 .xml -> .html 3 3683 .pdf -> .html 4 3565 . -> .html 5 2602 . -> .unk 6 2601 .xls -> .dbase3 7 2082 .text -> .unk 8 1943 . -> .pdf 9 1942 .text -> .html 10 1857 .doc -> .html 11 1088 .doc -> .rtf 12 620 .xls -> .html 13 595 .text -> .f 14 533 .text -> .xml 15 459 .ppt -> .html 16 438 .xls -> .txt 17 435 .doc -> .txt 18 346 .doc -> .wp 19 283 .txt -> .html 20 269 .eps -> .html 21 256 .log -> .html 22 253 .doc -> .unk 23 228 .swf -> .html 24 218 .xls -> .unk 25 179 .text -> .fits 26 175 .dwf -> .html 27 166 .gz -> .html 28 163 .sql -> .html 29 161 .text -> .tex 30 155 .html -> .xml 31 107 .html -> .pdf 32 96 .text -> .troff 33 94 .ps -> .html 34 70 .js -> .html 35 66 . -> .xml 36 60 .xls -> .gls 37 59 .ttf -> .txt 38 53 .text -> .sgml 39 45 .jpg -> .html 40 36 .ppt -> .txt 41 35 .csv -> .html 35 .ttf -> .html 43 30 .ppt -> .unk 44 29 .text -> .pdf 29 .xbm -> .txt 46 26 .java -> .html 26 .zip -> .html 48 25 .doc -> .fm 49 22 .text -> .rtf 50 21 .pub -> .html 51 20 .js -> .txt 52 17 .jar -> .html 17 .jar -> .txt 17 .text -> .gz 55 16 .ps -> .pdf 56 15 .ppt -> .doc 57 14 .text -> .swf 14 .tmp -> .html 14 .xbm -> .html 60 13 .doc -> .pdf 13 .doc -> .troff 62 9 .pps -> .html 9 .xlsx -> .html 64 8 .log -> .txt 65 7 . -> .rtf 7 .dll -> .html 7 .kml -> .html 7 .xls -> .wk1 (Lotus Notes) 69 6 .doc -> .f 6 .kmz -> .html 6 .xml -> .txt 72 5 . -> .txt 5 .doc -> .sgml 5 .docx -> .html 5 .eps -> .pdf 5 .exe -> .html 5 .html -> .rtf 78 4 .doc -> .ileaf (Interleaf) 4 .ppt -> .zip 4 .pptx -> .html 4 .text -> .doc 4 .text -> .kml 4 .xls -> .zip 84 3 .bmp -> .html 3 .jpeg -> .html 3 .ppt -> .sgml 3 .text -> .wp 3 .tif -> .html 3 .xls -> .doc 3 .xls -> .xml 91 2 .exported -> .html 2 .ppt -> .appledouble (AppleDouble encoded Macintosh file ) 2 .ppt -> .odp 2 .ppt -> .gd 2 .tmp -> .xml 2 .xls -> .123 2 .xls -> .lnk (MS Windows shortcut ) 2 .xls -> .pdf 99 1 .csv -> .rtf 1 .doc -> .par 1 .doc -> .zip 1 .doc -> .fits 1 .doc -> .gz 1 .doc -> .icns 1 .doc -> .tex 1 .doc -> .xls 1 .doc -> .xml 1 .docx -> .pdf 1 .hlp -> .html 1 .hmtl -> .html 1 .html -> .gif 1 .html -> .kml 1 .kml -> .xml 1 .pdf -> .xml 1 .ppt -> .pdf 1 .sql -> .txt 1 .sys -> .rtf 1 .wp -> .pdf 1 .wp -> .rtf 1 .xls -> .wk3 1 .xls -> .bin (mc68020 pure executable ) 1 .xls -> .f 1 .xls -> .sgml 1 .xml -> .kml
We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.
Forensic Innovations, Inc., analyzes the million file corpus
Forensic Innovations, Inc., makers of File Investigator TOOLS, has performed an analysis of the 986,278 files in the “1 million file corpus”. (13,722 files in the corpus were removed earlier this year because they were from California State Government web servers that were in the .gov domain and mistakingly collected as part of the original collection effort.)
We would like to thank Forensic Innovations for their work in support of this project. We have made available their summary report and will be making available their file-by-file analysis as soon as we deploy an appropriate database on this website.
Open Source Forensics Conference
ISO 9660 disk images from anti-forensics.ru posted
Our friends at anti-forensics.ru have given us seven very small disk images that are designed to demonstrate failings of particular open source Linux distributions.
You can view all of the images at https://digitalcorpora.org/corp/images/aor/. The images you will find there includes:
- 2009-aor-test_caine15.iso
- 2009-aor-test_deft5.iso
- 2009-aor-test_grml200910.iso
- 2009-aor-test_othernew.iso
- 2009-aor-test_otherold.iso
- 2009-aor-test_raptor20091026.iso
- 2009-aor-test_spada4.iso
These images should be directly copied to a hard drive or a partition. Forensic Linux distributions would use them as root file systems and execute proof-of-concept code during the boot.
Details of why these images are useful can be found on the author’s website, at: Linux_for_computer_forensic_investigators_2.pdf