Real Data Corpus

April 29th, 2017 Leave a comment Go to comments

The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.

Potential Uses

The Real Data Corpus is a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery tools.
  • Training students in forensics and data recovery
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

Current Contents

As of February 21, 2011, the Non-US Person’s Corpus consists of the following:

  • 1,289 hard drive images ranging in size from 500MB to 80GB.
  • 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • 98 CDROMs

For a total of 70TB of data (uncompressed).

Access and Availability

Real Data Corpus can be distributed to sponsors and collaborators as a set AFF and E01 files. The AFF files are encrypted with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption.

Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers. Alternatively, we can package the files onto portable terabyte USB hard drives.

Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.

In general, use of the RDC is limited to bonafide researchers operating under the oversight of an Institutional Review Board that has a DoD Assurance. For additional informaiton, please read theĀ Real Data Corpus FAQ.

Current Contents

Corpus Hard Drives Flash Drives Optical GB (Total Uncompressed)
BA 7 38
CA 73 1 1,064
CE 1 82
CH 2 5
CN 143 568 98 3,627
DE 36 1 755
GR 13 27
IL 229 4 2,226
IN 487 66 26,512
MX 175 1,110
NZ 1 4
PS 98 957
TH 1 3 13
UA 23 565
Total 1,289 643 98 36,990

To obtain access to the Real Data Corpus, please contact Michael McCarrin at the Naval Postgradaute School

  1. Steven Wood
    October 3rd, 2017 at 02:14 | #1

    Hello,

    I am a Doctor of Science in Cybersecurity student at Capitol Technology University and am interested in the Corpora for testing purposes.

    How can I go about gaining access?

    Thank you.

    Steven

  1. No trackbacks yet.

 

"This material is based upon work supported by the National Science Foundation under Grant No. 0919593. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."