The following post is now obsolete. The file frequent_hashcodes_and_paths_rdc.xml has been removed from the corpus as explained in a more recent post. Please see:
Deprecated Post from Mar 29, 2013 @ 13:13
The file frequent_hashcodes_and_paths_rdc.xml contains SHA1 hashcode and path data derived from the Real Drive Corpus collected by the DEEP Project at the U.S. Naval Postgraduate School. The file provides two kinds of data useful to forensic investigators: (1) SHA1 hashcodes that occurred for undeleted files on at least five different drives in the corpus but did not occur in the National Software Reference Library (http://www.nsrl.nist.gov). These are likely to indicate files uninteresting and excludable in most forensic investigations. File sizes and names are also given. (2) Path names (file name plus all directories) for paths that occurred on at least twenty different drives in the corpus on undeleted files. These usefully supplement the hashcodes in indicating recurring files uninteresting for investigators. However, occurrences of these files could include viruses and other malware, or could be hiding illegal content although it is unlikely.
Read more … http://digitalcorpora.org/corp/hashes/nus-deidentified/README-frequent-hashcodes-and-paths-rdc.txtDownload XML File (HAS BEEN REMOVED): http://digitalcorpora.org/corp/hashes/nus-deidentified/frequent-hashcodes-and-paths-rdc.xml (102 MB)