Real Data Corpus

The Real Data Corpus (RDC) was a collection of raw data extracted from data-carrying devices that were purchased on the secondary market around the world. Many studies between 1995 and 2005 shown that hard drives, cell phones, USB memory sticks, and other data-carrying devices are frequently discarded by their original users without the data first being cleared or purged. By purchasing these devices and extracting their data, we created a data set that closely mimics data as it is found in the real world.

Potential Uses

The Real Data Corpus was a one-of-a-kind scientific resource for:

  • Developing and validating forensic and data recovery tools.
  • Training students in forensics and data recovery
  • Developing and validating document translation software.
  • Exploring and characterizing real-world computing practices, configuration choices, and option settings.
  • Studying the storage allocation strategies of file systems under real-world conditions

Current Contents

As of February 21, 2011, the Non-US Person’s Corpus consisted of the following:

  • 1,289 hard drive images ranging in size from 500MB to 80GB.
  • 643 flash memory images (USB, Sony Memory Stick, SD and other), ranging from 128MB to 4GB.
  • 98 CDROMs

For a total of 70TB of data (uncompressed).

Access and Availability

Real Data Corpus is no longer available.

Historical Contents

Corpus Hard Drives Flash Drives Optical GB (Total Uncompressed)
BA 7 38
CA 73 1 1,064
CE 1 82
CH 2 5
CN 143 568 98 3,627
DE 36 1 755
GR 13 27
IL 229 4 2,226
IN 487 66 26,512
MX 175 1,110
NZ 1 4
PS 98 957
TH 1 3 13
UA 23 565
Total 1,289 643 98 36,990

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.