The Real Data Corpus (RDC) is a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies have shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we have created a data set that closely mimics data as it is found in the real world.
Potential Uses
The Real Data Corpus is a one-of-a-kind scientific resource for:
- Developing and validating forensic and data recovery tools.
- Training students in forensics and data recovery
- Developing and validating document translation software.
- Exploring and characterizing real-world computing practices, configuration choices, and option settings.
- Studying the storage allocation strategies of file systems under real-world conditions
Current Contents
As of February 21, 2011, the Non-US Person’s Corpus consists of the following:
- 1,289 hard drive images ranging in size from 500MB to 80GB.
- 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
- 98 CDROMs
For a total of 70TB of data (uncompressed).
Access and Availability
Real Data Corpus can be distributed to sponsors and collaborators as a set AFF and E01 files. The AFF files are encrypted with AES 256 and can be based on either a pass phrase or X.509 PKI using AFF encryption.
Disk images can be downloaded over the Internet from a secure server using SSL by authorized researchers. Alternatively, we can package the files onto portable terabyte USB hard drives.
Researchers can be given an account on a multi-user Linux computer on which all of the corpora resides.
In general, use of the RDC is limited to bonafide researchers operating under the oversight of an Institutional Review Board that has a DoD Assurance. For additional informaiton, please read the Real Data Corpus FAQ.
Current Contents
Corpus | Hard Drives | Flash Drives | Optical | GB (Total Uncompressed) |
---|---|---|---|---|
BA | 7 | 38 | ||
CA | 73 | 1 | 1,064 | |
CE | 1 | 82 | ||
CH | 2 | 5 | ||
CN | 143 | 568 | 98 | 3,627 |
DE | 36 | 1 | 755 | |
GR | 13 | 27 | ||
IL | 229 | 4 | 2,226 | |
IN | 487 | 66 | 26,512 | |
MX | 175 | 1,110 | ||
NZ | 1 | 4 | ||
PS | 98 | 957 | ||
TH | 1 | 3 | 13 | |
UA | 23 | 565 | ||
Total | 1,289 | 643 | 98 | 36,990 |
At the present time the Real Data Corpus is a restricted access data set and is not generally available.