Search Results

Keyword: ‘nps’

Announcing GOVDOCS1.1

December 4th, 2010 No comments

As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.

We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:

  Rank     Count     Value(s):
      1     77227      .text -> .txt  
      2      9290      .xml -> .html  
      3      3683      .pdf -> .html  
      4      3565      . -> .html  
      5      2602      . -> .unk  
      6      2601      .xls -> .dbase3  
      7      2082      .text -> .unk  
      8      1943      . -> .pdf  
      9      1942      .text -> .html  
     10      1857      .doc -> .html  
     11      1088      .doc -> .rtf  
     12       620      .xls -> .html  
     13       595      .text -> .f  
     14       533      .text -> .xml  
     15       459      .ppt -> .html  
     16       438      .xls -> .txt  
     17       435      .doc -> .txt  
     18       346      .doc -> .wp  
     19       283      .txt -> .html  
     20       269      .eps -> .html  
     21       256      .log -> .html  
     22       253      .doc -> .unk  
     23       228      .swf -> .html  
     24       218      .xls -> .unk  
     25       179      .text -> .fits  
     26       175      .dwf -> .html  
     27       166      .gz -> .html  
     28       163      .sql -> .html  
     29       161      .text -> .tex  
     30       155      .html -> .xml  
     31       107      .html -> .pdf  
     32        96      .text -> .troff  
     33        94      .ps -> .html  
     34        70      .js -> .html  
     35        66      . -> .xml  
     36        60      .xls -> .gls  
     37        59      .ttf -> .txt  
     38        53      .text -> .sgml  
     39        45      .jpg -> .html  
     40        36      .ppt -> .txt  
     41        35      .csv -> .html  
               35      .ttf -> .html  
     43        30      .ppt -> .unk  
     44        29      .text -> .pdf  
               29      .xbm -> .txt  
     46        26      .java -> .html  
               26      .zip -> .html  
     48        25      .doc -> .fm  
     49        22      .text -> .rtf  
     50        21      .pub -> .html  
     51        20      .js -> .txt  
     52        17      .jar -> .html  
               17      .jar -> .txt  
               17      .text -> .gz  
     55        16      .ps -> .pdf  
     56        15      .ppt -> .doc  
     57        14      .text -> .swf  
               14      .tmp -> .html  
               14      .xbm -> .html  
     60        13      .doc -> .pdf  
               13      .doc -> .troff  
     62         9      .pps -> .html  
                9      .xlsx -> .html  
     64         8      .log -> .txt  
     65         7      . -> .rtf  
                7      .dll -> .html  
                7      .kml -> .html  
                7      .xls -> .wk1  (Lotus Notes)  
     69         6      .doc -> .f  
                6      .kmz -> .html  
                6      .xml -> .txt  
     72         5      . -> .txt  
                5      .doc -> .sgml  
                5      .docx -> .html  
                5      .eps -> .pdf  
                5      .exe -> .html  
                5      .html -> .rtf  
     78         4      .doc -> .ileaf  (Interleaf)  
                4      .ppt -> .zip  
                4      .pptx -> .html  
                4      .text -> .doc  
                4      .text -> .kml  
                4      .xls -> .zip  
     84         3      .bmp -> .html  
                3      .jpeg -> .html  
                3      .ppt -> .sgml  
                3      .text -> .wp  
                3      .tif -> .html  
                3      .xls -> .doc  
                3      .xls -> .xml  
     91         2      .exported -> .html  
                2      .ppt -> .appledouble (AppleDouble encoded Macintosh file  )
                2      .ppt -> .odp
                2      .ppt -> .gd
                2      .tmp -> .xml  
                2      .xls -> .123
                2      .xls -> .lnk (MS Windows shortcut  )
                2      .xls -> .pdf  
     99         1      .csv -> .rtf  
                1      .doc -> .par 
                1      .doc -> .zip
                1      .doc -> .fits  
                1      .doc -> .gz  
                1      .doc -> .icns  
                1      .doc -> .tex  
                1      .doc -> .xls  
                1      .doc -> .xml  
                1      .docx -> .pdf  
                1      .hlp -> .html  
                1      .hmtl -> .html  
                1      .html -> .gif  
                1      .html -> .kml  
                1      .kml -> .xml  
                1      .pdf -> .xml  
                1      .ppt -> .pdf  
                1      .sql -> .txt  
                1      .sys -> .rtf  
                1      .wp -> .pdf  
                1      .wp -> .rtf  
                1      .xls -> .wk3
                1      .xls -> .bin  (mc68020 pure executable  )
                1      .xls -> .f  
                1      .xls -> .sgml  
                1      .xml -> .kml  

We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.

Categories: Files Tags:

Real Data Corpus FAQ

March 7th, 2010 No comments

The Real Data Corpus

The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery algorithms and tools.
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

Current Contents

  • A total of 156 hard drive images ranging in size from 500MB to 80GB.
  • Approximately 600 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • Approximately 100 CDs, all purchased outside the US.
  • Approximately 10 digital camera memory images.
  • Approximately 40 GSM SIM chip memory images.

More details of the corpus content can be found in Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada.[1]

IRB Approval Required for “Research”

The National Research Act[2] (NRA) of 1974 and the Common Rule,[3] govern all federally funded research in the United States that is performed with human beings as experimental subjects. Because portions of the Real Data Corpus were funded by the US Government, this legal framework must be followed in research involving the Real Data Corpus.

The Common Rule creates a four-part test that determines whether or not proposed activity must be reviewed by an IRB. Specifically, IRB approval is required if:

  1. The activity constitutes scientific “research,” a term that the Common Rule broadly defines as “a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge.”[4]
  2. The research must be federally funded.[5]
  3. The research must involve human subjects, which the Common Rule defines as “a living individual about whom an investigator (whether professional or student) conducting research obtains  (1) data through intervention or interaction with the individual, or (2) identifiable private information.”[6]
  4. The research is not “exempt” under the regulations.[7] The Common Rule exempts research involving  “existing data, documents, [and] records…” provided that the data set is either “publicly available” or that the subjects “cannot be identified, directly or through identifiers linked to the subjects”(§46.101(b)(4)).

Research involving the Real Data Corpus is not exempt under the Common Rule because the RDC is not publicly available and in many cases it is possible to identify individuals whose data are in the collection. Furthermore, the majority of the subjects included in the Real Data Corpus have not provided consent to have their data used for research.

Mitigating factors allowing the use of this data is the fact that the data was lawfully obtained, research involving this data is “minimal risk” (provided that the data is properly protected and personally identifiable information inside the RDC is kept confidential), the fact that there is substantial public benefit in using the RDC for research into computer forensics and computer security, and the fact that there is no practical alternative to using this data.

Even if research involving the RDC were exempt, most US universities do not allow experiments to make their own determination of exemption. Instead, these institutions require that the experimenter submit an application for exempt research to the IRB.

To date no IRB has blocked the approval of research that involves the RDC.

In order to submit an application to an IRB it is necessary for all experimenters who will make use of the human subject data to take the appropriate human subject training proscribed by their institution. Most institutions prohibit students from filing applications directly, and instead require that an application be filed by a researcher or professor that can be considered a “principal investigator” for external funding.

As a result, any proposed use of the RDC in research requires that an IRB application be filed with the host institution and with the Naval Postgraduate School. A copy of both the application and the approval from both the host institution and NPS must be provided prior to access being granted. The application must clearly state:

  • The proposed research that is to be done.
  • Why it is necessary to use the RDC; why simulated or realistic data cannot be used as an alternative.
  • What measures will be used to protect the data in the RDC.
  • What measures will be used to prevent the publication of personally identifiable information in any research products.

Please provide us with your IRB application prior to submitting it to your IRB! We can review the application and let you know if it is consistent with the IRB approval that we have already approved, or if we will need to apply for additional IRB approval.

Sample applications are available upon request.

Alternatives to IRB Approval

If you are interested in working with realistic disk images and do not wish to obtain IRB approval, you may be interested in working with the NPS Realistic Corpora. These are actual disk images of working systems, but the data on the disks was created by investigators according to scripts: the images do not contain identifiable information from actual persons. These images can be downloaded from without prior approval.

Access and Availability

Real Data Corpus can be distributed to sponsors and collaborators as a set of encrypted AFF files. Encryption is with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption:

  • Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers.
  • Alternatively, we can package the files onto portable terabyte USB hard drives or optical tape.
  • Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.
  • Finally, we have developed a remote exploitation framework: we publish XML files of each drive’s metadata; you select which sectors you need and download them over the Internet using our XMLRPC framework.

Individual files can be accessed from the Internet from our secure server using a remote exploitation framework based on XMLRPC that we have devised:

  • Collaborators and sponsors are given a username and password which allows access to our subversion source code repository, our research wiki, and the web-based catalog of disk images.
  • Each disk image has been processed using fiwalk, a file system walking program that uses the Sleuth Kit API. The fiwalk program creates an XML data structure for each disk that includes the partitions, the resident files, deleted files, and orphan files. Each file is listed by its file name (if available), its MAC times, MD5, SHA1, extractable metadata, and a unique ID. These XML files can be downloaded using HTTP.
  • A web service using XMLRPC takes the unique IDs and returns the bytes associated with the matching file.

Ownership and Legal Status

The Non-US Real Data Corpus consists of images from hard drives, flash memory, and small devices using money purchased under NSF Award 0730389 and with other governmental funds. These images are available to qualified researchers that agree with the terms of the Institutional Review Board application under which the data was collected.

Because the RDC was purchased on the secondary market, use in the United States is governed by the “First Sale” doctrine and by the US Supreme Court’s ruling in California v. Greenwood (486 US 35). Essentially, when the data carrying devices were sold and/or discarded, all privacy rights to the data in those devices was forfeit.

Simson Garfinkel asserts a compilation copyright for the two Garfinkel corpora.

Privacy Issues

Because the media on which the Real Data Corpus was lawfully purchased on the secondary market, legally the original data custodians forfeit any privacy rights that the data might have previously contained.

From a moral perspective, however, the information in this corpus must be treated with respect and processed using strong computer security measures. That is because the Real Data Corpus literally contains “real data from real people.” Many of the data subjects did not knowingly release the information in the corpus: the data subjects may have tried but failed to erase the contents of the media before it was sold on the secondary market. Alternatively the data may have been released not by the subject, but by a data custodian such as a business or consultant.  For these reasons it is our practice to treat this information as privacy-sensitive data, even though legally it is not.


[2] PL 93-348, see

[3] 45 CFR 46, see

[4] §46.102 (d)

[5] §46.103 (a)

[6] §46.102 (f)

[7] §46.101 (b)

Categories: Tags:

Disk Images

June 2nd, 2009 Comments off

We have many sources of disk images available for use in education and research. The easiest disk images to work with are the NPS Test Disk Images. We also have detailed scenarios that contain multiple disk images. Finally, we have real disk images containing real data from real people; IRB approval is required to work with those disks.

A word about copyright: Some of the disk corpora contains information that is covered by copyright under US Law—specifically copies of the Microsoft Windows operating system.  US Copyright Law has a four-part test that determines whether or not the distribution of copyrighted material is permissible under “fair use.” To this end, we have developed a program that breaks Microsoft executables in a way that cannot be reversed. We believe that distributing disk images with broken executables for research and educational purposes is permissible under fair use because doing so does not damage the value of the Microsoft copyrighted information that the disk images contain. Please let us know if you feel differently or if you have an alternative strategy for distributing these important research materials.

NPS Test Disk Images

NPS Test Disk Images are a set of disk images that have been created for testing computer forensic tools. These images are free of non-public Personally Identifiable Information (PII) and are approved for release to the general public. The NPS-created data in these images is public domain and free of any copyright restriction; the images may contain some copyrighted data that was made freely available by the copyright holder. These copyrights, where known, are noted in the files themselves. Currently the following images in the NPS corpus have been released:

  • nps-2009-canon2 — A set of images taken on with a Canon digital camera that can be used to test basic file recovery, fragmented file recovery, and file carving.
  • nps-2009-casper-rw — An ext3 file system from a bootable USB token that had an installation of Ubuntu 8.10. The operating system was used to browse several US Government websites.
  • nps-2009-hfsjtest1 — A test image of a journaled HFS system in which the data from a previous version of a file can only be recovered from the HFS journal
  • nps-2009-ntfs1 — A test image of an NTFS file system including unfragmented and highly fragmented files stored in raw, compressed, and encrypted directories. The decryption key is provided.
  • nps-2009-ubnist1 — The FAT32 file system from which the nps-2009-capser-rw disk image was extracted.
  • nps-2009-domexusers — This is a disk image of a Windows XP SP3 system that has two users, domexuser1 and domexuser2, who communicate with a third user (domexuser3) via IM and email. Two versions of this disk image will be provided:
    • nps-2009-domexusers – The full system, distributed as an encrypted disk image.
    • nps-2009-domexusers-redacted – The full system with the Microsoft Windows executables redacted so that they cannot be executed.
  • nps-2010-emails — is a test disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.
  • nps-2014-usb-nondeterministic – this is a series of disk images that were made from a USB storage device that produced different data each time it was read. The original submission ZIP file and narrative are presented, as well as E01 files that were created by extracting the raw files from the ZIP image and re-encoding them.

Digital Corpora Scenarios

You will find additional disk images in on the Scenarios page, including:

      • M57-Jean – A single disk scenario involving the exfiltration of corporate documents from an executive’s laptop.
      • Nitroba University Harassment Scenario – A fun-to-solve network forensics scenario.
      • M57-Patents – A complex scenario involving multiple drives and actors set at a small company over the course of several weeks.

The Real Data Corpus

Currently there are over 750 images available for use by bona fide researchers. The images are divided into two categories:

  • Non-US Persons Disk Image Corpus
    Contains images from disks purchased outside the United States.
  • US Persons Disk Image Corpus
    Contains images from disks purchased within the United States.

More information about the Real Data Corpus is availableelsewhere on this server.


Please feel free to let us know if you find this corpus  is useful by leaving a comment below. If you decide to use this corpus in published research, the appropriate citation is: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada


Format Conversion

Many of the disk images are distributed in E01 or AFF format. For information on format conversion, please see this page.

See Also

Looking for more disk images? You will find them:

Categories: Tags: