This file lists SHA-1 hash values of files that are uninteresting for
forensic investigations on a variety of criteria, including frequency
on drives of both hash value and path, time of creation within both
the minute and the week, file size, directory context both in path and
in sibling files, and file extension. Hash values are listed as 40
hexadecimal characters. This data is derived from the Real Drive
Corpus collected by the DEEP Project at the U.S. Naval Postgraduate
School, plus data from drives in classrooms and laboratories at NPS
and some other sources. Hash values in the January 2014
version of NSRL (the National Software Reference Library, nist.gov)
have been excluded.
The criteria for selecting these hash values and the methods used to
obtain them are described in
http://faculty.nps.edu/ncrowe/uninteresting.htm but have now been
applied to significantly more files than the corpus used for the
paper. Our methods focus on cross-correlation of files in a large
corpus and are thus quite different from those used in collecting the
NSRL data. They were obtained from images of 245 million files on
3905 drives. Currently our set has 16 million hash values not in
NSRL, and NSRL has currently 36 million hash values, so this is a
significant supplement to NSRL.
This data was produced in July 2014 by Neil Rowe, email@example.com.
Please acknowledge us in publications if you use this data.
The text file govdocs1-first512-first4096-docid.txt containing MD5 hashes of the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus has been removed. This file was provided to assist with research of block hashes. We have since created the hashdb toolset which provides support for creating and working with hash block databases. Please refer to https://github.com/simsong/hashdb/wiki for downloading the code, continuing progress on this topic, and links to relevant papers including:
Distinct Sector Hashes for Target File Detection
A related masters thesis on this topic was completed at Naval Postgraduate School in 2012 and can be downloaded for additional reading: http://simson.net/ref/2012/kmf_thesis.pdf
File bulk_extractor-1.3.1.zip contains the source code for bulk_extractor v1.3.1. bulk_extractor is a C++ program that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. bulk_extractor is typically downloaded on a Fedora system and compiled or cross-compiled to Linux, Mac, or Windows using autotools. Please see https://github.com/simsong/bulk_extractor/wiki/Introducing-bulk_extractor.
BEViewer.jar is an executable bulk_extractor viewer user interface.
Bulk Extractor Viewer (BEViewer) provides a graphical user interface for browsing features that have been extracted via the bulk extractor feature extraction tool. Please see https://github.com/simsong/bulk_extractor/wiki/BEViewer.
be_installer-1.3.exe is a Windows installer for installing bulk_extractor and BEViewer v1.3 on a Windows system.
bulk_extractor.pdf, “Digital media triage with bulk data analysis and bulk-extractor,” discusses how the bulk_extractor tool is effective in providing bulk data analysis.
2012-08-08 bulk_extractor Tutorial.pdf describes how to use the BEViewer tool. Although some of the parameters for running bulk_extractor have changed, the majority of the tutorial remains current..
Source: The information above and links were received from Bruce Allen <firstname.lastname@example.org>, Naval Postgraduate School
See other bulk_extractor downloads here: http://digitalcorpora.org/downloads/bulk_extractor/
The following post is now obsolete. The file frequent_hashcodes_and_paths_rdc.xml has been removed from the corpus as explained in a more recent post. Please see:
Deprecated Post from Mar 29, 2013 @ 13:13
The file frequent_hashcodes_and_paths_rdc.xml contains SHA1 hashcode and path data derived from the Real Drive Corpus collected by the DEEP Project at the U.S. Naval Postgraduate School. The file provides two kinds of data useful to forensic investigators: (1) SHA1 hashcodes that occurred for undeleted files on at least five different drives in the corpus but did not occur in the National Software Reference Library (http://www.nsrl.nist.gov). These are likely to indicate files uninteresting and excludable in most forensic investigations. File sizes and names are also given. (2) Path names (file name plus all directories) for paths that occurred on at least twenty different drives in the corpus on undeleted files. These usefully supplement the hashcodes in indicating recurring files uninteresting for investigators. However, occurrences of these files could include viruses and other malware, or could be hiding illegal content although it is unlikely.
Read more … http://digitalcorpora.org/corp/nus-deidentified/README-frequent-hashcodes-and-paths-rdc.txt
Download XML File (HAS BEEN REMOVED): http://digitalcorpora.org/corp/nus-deidentified/frequent-hashcodes-and-paths-rdc.xml (102 MB)