The Real Data Corpus (RDC) was a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies between 1995 and 2005 shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we created a data set that closely mimics data as it is found in the real world.
Potential Uses
The Real Data Corpus was a one-of-a-kind scientific resource for:
- Developing and validating forensic and data recovery tools.
- Training students in forensics and data recovery
- Developing and validating document translation software.
- Exploring and characterizing real-world computing practices, configuration choices, and option settings.
- Studying the storage allocation strategies of file systems under real-world conditions
Current Contents
As of February 21, 2011, the Non-US Person’s Corpus consisted of the following:
- 1,289 hard drive images ranging in size from 500MB to 80GB.
- 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
- 98 CDROMs
For a total of 70TB of data (uncompressed).
Access and Availability
Real Data Corpus is no longer available.
Historical Contents
Corpus | Hard Drives | Flash Drives | Optical | GB (Total Uncompressed) |
---|---|---|---|---|
BA | 7 | 38 | ||
CA | 73 | 1 | 1,064 | |
CE | 1 | 82 | ||
CH | 2 | 5 | ||
CN | 143 | 568 | 98 | 3,627 |
DE | 36 | 1 | 755 | |
GR | 13 | 27 | ||
IL | 229 | 4 | 2,226 | |
IN | 487 | 66 | 26,512 | |
MX | 175 | 1,110 | ||
NZ | 1 | 4 | ||
PS | 98 | 957 | ||
TH | 1 | 3 | 13 | |
UA | 23 | 565 | ||
Total | 1,289 | 643 | 98 | 36,990 |