The Real Data Corpus
The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.
The Real Data Corpus is a one-of-a-kind scientific resource for:
- Developing and validating forensic and data recovery algorithms and tools.
- Developing and validating document translation software.
- Exploring and characterizing real-world computing practices, configuration choices, and option settings.
- Studying the storage allocation strategies of file systems under real-world conditions
- A total of 156 hard drive images ranging in size from 500MB to 80GB.
- Approximately 600 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
- Approximately 100 CDs, all purchased outside the US.
- Approximately 10 digital camera memory images.
- Approximately 40 GSM SIM chip memory images.
More details of the corpus content can be found in Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada.1
IRB Approval Required for “Research”
The National Research Act (NRA) of 1974 and the Common Rule, govern all federally funded research in the United States that is performed with human beings as experimental subjects. Because portions of the Real Data Corpus were funded by the US Government, this legal framework must be followed in research involving the Real Data Corpus.
The Common Rule creates a four-part test that determines whether or not proposed activity must be reviewed by an IRB. Specifically, IRB approval is required if:
- The activity constitutes scientific “research,” a term that the Common Rule broadly defines as “a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge.”
- The research must be federally funded.
- The research must involve human subjects, which the Common Rule defines as “a living individual about whom an investigator (whether professional or student) conducting research obtainsÂ (1) data through intervention or interaction with the individual, or (2) identifiable private information.”
- The research is not “exempt” under the regulations. The Common Rule exempts research involving “existing data, documents, [and] records” provided that the data set is either “publicly available” or that the subjects “cannot be identified, directly or through identifiers linked to the subjects”(§46.101(b)(4)).
Research involving the Real Data Corpus is not exempt under the Common Rule because the RDC is not publicly available and in many cases it is possible to identify individuals whose data are in the collection. Furthermore, the majority of the subjects included in the Real Data Corpus have not provided consent to have their data used for research.
Mitigating factors allowing the use of this data is the fact that the data was lawfully obtained, research involving this data is “minimal risk” (provided that the data is properly protected and personally identifiable information inside the RDC is kept confidential), the fact that there is substantial public benefit in using the RDC for research into computer forensics and computer security, and the fact that there is no practical alternative to using this data.
Even if research involving the RDC were exempt, most US universities do not allow experiments to make their own determination of exemption. Instead, these institutions require that the experimenter submit an application for exempt research to the IRB.
To date no IRB has blocked the approval of research that involves the RDC.
In order to submit an application to an IRB it is necessary for all experimenters who will make use of the human subject data to take the appropriate human subject training proscribed by their institution. Most institutions prohibit students from filing applications directly, and instead require that an application be filed by a researcher or professor that can be considered a “principal investigator” for external funding.
As a result, any proposed use of the RDC in research requires that an IRB application be filed with the host institution and with the Naval Postgraduate School. A copy of both the application and the approval from both the host institution and NPS must be provided prior to access being granted. The application must clearly state:
- The proposed research that is to be done.
- Why it is necessary to use the RDC; why simulated or realistic data cannot be used as an alternative.
- What measures will be used to protect the data in the RDC.
- What measures will be used to prevent the publication of personally identifiable information in any research products.
Please provide us with your IRB application prior to submitting it to your IRB! We can review the application and let you know if it is consistent with the IRB approval that we have already approved, or if we will need to apply for additional IRB approval.
Sample applications are available upon request.
Alternatives to IRB Approval
If you are interested in working with realistic disk images and do not wish to obtain IRB approval, you may be interested in working with the NPS Realistic Corpora. These are actual disk images of working systems, but the data on the disks was created by investigators according to scripts: the images do not contain identifiable information from actual persons. These images can be downloaded from http://digitalcorpora.org/ without prior approval.
Access and Availability
Real Data Corpus can be distributed to sponsors and collaborators as a set of encrypted AFF files. Encryption is with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption:
- Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers.
- Alternatively, we can package the files onto portable terabyte USB hard drives or optical tape.
- Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.
- Finally, we have developed a remote exploitation framework: we publish XML files of each drive”‘s metadata; you select which sectors you need and download them over the Internet using our XMLRPC framework.
Individual files can be accessed from the Internet from our secure server using a remote exploitation framework based on XMLRPC that we have devised:
- Collaborators and sponsors are given a username and password which allows access to our subversion source code repository, our research wiki, and the web-based catalog of disk images.
- Each disk image has been processed using fiwalk, a file system walking program that uses the Sleuth Kit API. The fiwalk program creates an XML data structure for each disk that includes the partitions, the resident files, deleted files, and orphan files. Each file is listed by its file name (if available), its MAC times, MD5, SHA1, extractable metadata, and a unique ID. These XML files can be downloaded using HTTP.
- A web service using XMLRPC takes the unique IDs and returns the bytes associated with the matching file.
Ownership and Legal Status
The Non-US Real Data Corpus consists of images from hard drives, flash memory, and small devices using money purchased under NSF Award 0730389 and with other governmental funds. These images are available to qualified researchers that agree with the terms of the Institutional Review Board application under which the data was collected.
Because the RDC was purchased on the secondary market, use in the United States is governed by the “First Sale” doctrine and by the US Supreme Court”‘s ruling in California v. Greenwood (486 US 35). Essentially, when the data carrying devices were sold and/or discarded, all privacy rights to the data in those devices was forfeit.
Simson Garfinkel asserts a compilation copyright for the two Garfinkel corpora.
Because the media on which the Real Data Corpus was lawfully purchased on the secondary market, legally the original data custodians forfeit any privacy rights that the data might have previously contained.
From a moral perspective, however, the information in this corpus must be treated with respect and processed using strong computer security measures. That is because the Real Data Corpus literally contains “real data from real people.” Many of the data subjects did not knowingly release the information in the corpus: the data subjects may have tried but failed to erase the contents of the media before it was sold on the secondary market. Alternatively the data may have been released not by the subject, but by a data custodian such as a business or consultant. For these reasons it is our practice to treat this information as privacy-sensitive data, even though legally it is not.
 PL 93-348, see https://www.govinfo.gov/content/pkg/STATUTE-88/pdf/STATUTE-88-Pg342.pdf
 45 CFR 46, see http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm
 §46.102 (d)
 §46.103 (a)
 §46.102 (f)
 §46.101 (b)