Real Data Corpus
The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.
The Real Data Corpus is a one-of-a-kind scientific resource for:
- Developing and validating forensic and data recovery tools.
- Training students in forensics and data recovery
- Developing and validating document translation software.
- Exploring and characterizing real-world computing practices, configuration choices, and option settings.
- Studying the storage allocation strategies of file systems under real-world conditions
As of February 21, 2011, the Non-US Person’s Corpus consists of the following:
- 1,289 hard drive images ranging in size from 500MB to 80GB.
- 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
- 98 CDROMs
For a total of 36TB of data (uncompressed).
Access and Availability
Real Data Corpus can be distributed to sponsors and collaborators as a set AFF and E01 files. The AFF files are encrypted with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption.
Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers. Alternatively, we can package the files onto portable terabyte USB hard drives.
Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.
Finally, we have developed a remote access framework: we publish XML files of each drive’s metadata; you select which sectors you need and download them over the Internet using our XMLRPC framework
|Corpus||Hard Drives||Flash Drives||Optical||GB (Total Uncompressed)|