File bulk_extractor-1.3.1.zip contains the source code for bulk_extractor v1.3.1. bulk_extractor is a C++ program that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. bulk_extractor is typically downloaded on a Fedora system and compiled or cross-compiled to Linux, Mac, or Windows using autotools. Please see https://github.com/simsong/bulk_extractor/wiki/Introducing-bulk_extractor.
BEViewer.jar is an executable bulk_extractor viewer user interface.
Bulk Extractor Viewer (BEViewer) provides a graphical user interface for browsing features that have been extracted via the bulk extractor feature extraction tool. Please see https://github.com/simsong/bulk_extractor/wiki/BEViewer.
be_installer-1.3.exe is a Windows installer for installing bulk_extractor and BEViewer v1.3 on a Windows system.
bulk_extractor.pdf, “Digital media triage with bulk data analysis and bulk-extractor,” discusses how the bulk_extractor tool is effective in providing bulk data analysis.
2012-08-08 bulk_extractor Tutorial.pdf describes how to use the BEViewer tool. Although some of the parameters for running bulk_extractor have changed, the majority of the tutorial remains current..
Source: The information above and links were received from Bruce Allen <bdallen@nps.edu>, Naval Postgraduate School
See other bulk_extractor downloads here: http://digitalcorpora.org/downloads/bulk_extractor/
The file frequent_hashcodes_and_paths_rdc.xml contains SHA1 hashcode and path data
derived from the Real Drive Corpus collected by the DEEP Project at the U.S. Naval
Postgraduate School.
The file provides two kinds of data useful to forensic investigators:
(1) SHA1 hashcodes that occurred for undeleted files on at least five different
drives in the corpus but did not occur in the National Software Reference
Library (http://www.nsrl.nist.gov). These are likely to indicate files
uninteresting and excludable in most forensic investigations. File sizes and
names are also given.
(2) Path names (file name plus all directories) for paths that occurred on at
least twenty different drives in the corpus on undeleted files. These usefully
supplement the hashcodes in indicating recurring files uninteresting for
investigators. However, occurrences of these files could include viruses and
other malware, or could be hiding illegal content although it is unlikely.
Read more … http://digitalcorpora.org/corp/nus-deidentified/README-frequent-hashcodes-and-paths-rdc.txt
Download XML File: http://digitalcorpora.org/corp/nus-deidentified/frequent-hashcodes-and-paths-rdc.xml (102 MB)
We have created a tar file with 109,282 files from the govdocs1m corpus. You can download it from:
http://digitalcorpora.org/corp/nps/files/govdocs1/files.jpeg.tar
The scenario page for M57-Jean has now been posted.
I have created a new disk image called 2010-nps-emails that can be used for testing programs that find email addresses or perform string search.
The disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.
Below are a list of the email addresses and their codings:
email address Application (Encoding)
plain_text@textedit.com Apple TextEdit (UTF-8)
plain_text_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode)
rtf_text@textedit.com Apple TextEdit (RTF)
rtf_text_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode)
plain_utf16@textedit.com Apple TextEdit (UTF-16)
plain_utf16_pdf@textedit.com Apple TextEdit print-to-PDF (/FlateDecode)
pages@iwork09.com Apple Pages '09
pages_comment@iwork09.com Apple Pages (comment) '09
keynote@iwork09.com Apple Keynote '09
keynote_comment@iwork09.com Apple Keynote '09 (comment)
numbers@iwork09.com Apple Numbers '09
numbers_comment@iwork09.com Apple Numbers '09 (comment)
user_doc@microsoftword.com Microsoft Word 2008 (Mac) (.doc file)
user_doc_pdf@microsoftword.com Microsoft Word 2008 (Mac) print-to-PDF
user_docx@microsoftword.com
user_docx_pdf@microsoftword.com Microsoft Word 2008 (Mac) print-to-PDF (.docx file)
xls_cell@microsoft_excel.com
xls_comment@microsoft_excel.com Microsoft Word 2008 (Mac)
xlsx_cell@microsoft_excel.com Microsoft Word 2008 (Mac)
xlsx_comment@microsoft_excel.com Microsoft Word 2008 (Mac) (Comment)
doc_within_doc@document.com Microsoft Word 2007 (OLE .doc file within .doc)
docx_within_docx@document.com Microsoft Word 2007 (OLE .doc file within .doc)
ppt_within_doc@document.com Microsoft PowerPoint and Word 2007 (OLE .ppt file within .doc)
pptx_within_docx@document.com Microsoft PowerPoint and Word 2007 (OLE .pptx file within .docx)
xls_within_doc@document.com Microsoft Excel and Word 2007 (OLE .xls file within .doc)
xlsx_within_docx@document.com Microsoft Excel and Word 2007 (OLE .xlsx file within .docx)
email_in_zip@zipfile1.com text file within ZIP
email_in_zip_zip@zipfile2.com ZIP'ed text file, ZIP'ed
email_in_gzip@gzipfile.com text file within GZIP
email_in_gzip_gzip@gzipfile.com GZIP'ed text file, GZIP'ed
The image can be downloaded from http://digitalcorpora.org/corp/nps/drives/nps-2010-emails/
Edit, 2011-11-26 19:32 PST: One email was incorrectly recorded above. xlsx_comment@microsoft_excel.com is within the disk image, but xlsx_cell_comment@microsoft_excel.com was recorded here. That is now corrected above.
I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from http://digitalcorpora.org/corp/nps/files/govdocs1/govdocs1-first512-first4096-docid.txt
December 27th, 2010
admin
I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.
Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.
Rank Count Value(s):
============================
1 2334 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2 851 MLBot (www.metadatalabs.com/mlbot)
3 811 SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
4 749 Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
5 492 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
6 130 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
7 115 Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/)
8 109 msnbot/2.0b (+http://search.msn.com/msnbot.htm)
9 108 Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
10 89 CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
11 87 Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
12 78 TwengaBot-Discover (http://www.twenga.fr/bot-discover.html)
13 58 Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
14 51 msnbot/1.1 (+http://search.msn.com/msnbot.htm)
15 26 Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
16 21 Cityreview Robot (+http://www.cityreview.org/crawler/)
17 18 'citeseerxbot'
18 15 SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot)
19 12 Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
20 11 Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
21 9 Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
22 7 CatchBot/3.0; +http://www.catchbot.com
7 CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp)
7 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
25 6 Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php)
6 yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
27 5 MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
5 yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html
5 yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html
30 3 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
31 2 Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9
2 yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
2 yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html
2 yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html
35 1 Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)
1 Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
1 findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
1 librabot/1.0 (+http://search.msn.com/msnbot.htm)
1 yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html
1 yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html
1 yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html
1 yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
1 yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html
Total items printed: 6242
December 10th, 2010
admin
The M57-Patents scenario is now available. This scenario includes nearly a terabyte of information with 50 disk images, memory dumps, and network packets. There are three specific crimes in the scenario that can be solved, but there are also collections of data that can be used to enable a variety of computer forensics research projects and tool development.
The scenario is split up into many pieces so you can download just what you need.
You can download it from
As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.
We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:
Rank Count Value(s):
============================
1 77227 .text -> .txt
2 9290 .xml -> .html
3 3683 .pdf -> .html
4 3565 . -> .html
5 2602 . -> .unk
6 2601 .xls -> .dbase3
7 2082 .text -> .unk
8 1943 . -> .pdf
9 1942 .text -> .html
10 1857 .doc -> .html
11 1088 .doc -> .rtf
12 620 .xls -> .html
13 595 .text -> .f
14 533 .text -> .xml
15 459 .ppt -> .html
16 438 .xls -> .txt
17 435 .doc -> .txt
18 346 .doc -> .wp
19 283 .txt -> .html
20 269 .eps -> .html
21 256 .log -> .html
22 253 .doc -> .unk
23 228 .swf -> .html
24 218 .xls -> .unk
25 179 .text -> .fits
26 175 .dwf -> .html
27 166 .gz -> .html
28 163 .sql -> .html
29 161 .text -> .tex
30 155 .html -> .xml
31 107 .html -> .pdf
32 96 .text -> .troff
33 94 .ps -> .html
34 70 .js -> .html
35 66 . -> .xml
36 60 .xls -> .gls
37 59 .ttf -> .txt
38 53 .text -> .sgml
39 45 .jpg -> .html
40 36 .ppt -> .txt
41 35 .csv -> .html
35 .ttf -> .html
43 30 .ppt -> .unk
44 29 .text -> .pdf
29 .xbm -> .txt
46 26 .java -> .html
26 .zip -> .html
48 25 .doc -> .fm
49 22 .text -> .rtf
50 21 .pub -> .html
51 20 .js -> .txt
52 17 .jar -> .html
17 .jar -> .txt
17 .text -> .gz
55 16 .ps -> .pdf
56 15 .ppt -> .doc
57 14 .text -> .swf
14 .tmp -> .html
14 .xbm -> .html
60 13 .doc -> .pdf
13 .doc -> .troff
62 9 .pps -> .html
9 .xlsx -> .html
64 8 .log -> .txt
65 7 . -> .rtf
7 .dll -> .html
7 .kml -> .html
7 .xls -> .wk1 (Lotus Notes)
69 6 .doc -> .f
6 .kmz -> .html
6 .xml -> .txt
72 5 . -> .txt
5 .doc -> .sgml
5 .docx -> .html
5 .eps -> .pdf
5 .exe -> .html
5 .html -> .rtf
78 4 .doc -> .ileaf (Interleaf)
4 .ppt -> .zip
4 .pptx -> .html
4 .text -> .doc
4 .text -> .kml
4 .xls -> .zip
84 3 .bmp -> .html
3 .jpeg -> .html
3 .ppt -> .sgml
3 .text -> .wp
3 .tif -> .html
3 .xls -> .doc
3 .xls -> .xml
91 2 .exported -> .html
2 .ppt -> .appledouble (AppleDouble encoded Macintosh file )
2 .ppt -> .odp
2 .ppt -> .gd
2 .tmp -> .xml
2 .xls -> .123
2 .xls -> .lnk (MS Windows shortcut )
2 .xls -> .pdf
99 1 .csv -> .rtf
1 .doc -> .par
1 .doc -> .zip
1 .doc -> .fits
1 .doc -> .gz
1 .doc -> .icns
1 .doc -> .tex
1 .doc -> .xls
1 .doc -> .xml
1 .docx -> .pdf
1 .hlp -> .html
1 .hmtl -> .html
1 .html -> .gif
1 .html -> .kml
1 .kml -> .xml
1 .pdf -> .xml
1 .ppt -> .pdf
1 .sql -> .txt
1 .sys -> .rtf
1 .wp -> .pdf
1 .wp -> .rtf
1 .xls -> .wk3
1 .xls -> .bin (mc68020 pure executable )
1 .xls -> .f
1 .xls -> .sgml
1 .xml -> .kml
You can download the script to perform the fixes from: http://domex.nps.edu/corp/files/govdocs1/fixgovdocs1.zip
We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.