Govdocs1 â€” (nearly) 1 million freely-redistributable files
For these reasons, we have created and released a corpus of 1 millionÂ documents that are freely available for research and may be (to theÂ best of our knowledge) freely redistributed. These documents wereÂ obtained by performing searches for words randomly chosen from theÂ Unix dictionary, numbers randomly chosen between 1 and 1 million, andÂ randomized combinations of the two, for documents of specified fileÂ types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
Each file in the corpus is presented as a numbered file with aÂ file extension (e.g. 0000001.jpg). The file extension is typically the file extension that was provided to us when the file was downloaded. The file extension is a suggestionâ€”it is not part of the corpus.
We are making the corpus available in several ways:
- A set of 1000 directories, with 1000 files in each directory, downloadable from our server at http://downloads.digitalcorpora.org/corpora/files/govdocs1/.
- A set of 1000 ZIP files, each with 1000 files, downloadable from our server atÂ http://downloads.digitalcorpora.org/corpora/files/govdocs1/zipfiles/.
- AÂ a tar file with 109,282 JPEG pictures from the govdocs1m corpus atÂ http://downloads.digitalcorpora.org/corpora/files/govdocs1/files.jpeg.tar.
- As a set of 10 subset “threads” (subset0.zip through subset9.zip), each one containingÂ containing 1000 randomly chosen documents. These subsets wereÂ specifically created for to facilitate pilot studies and studentÂ research projects with the rationale that it’s easier to work withÂ 1000 files than with 1 million. Students are encouraged to use oneÂ subset for development and another subset for testing.
- A contextual feature list of data from the digitalcorpora can be found at http://downloads.digitalcorpora.org/corpora/files/2012-feature-list/
The following metadata is provided for each of the files:
- The URL from which the file was downloaded.
- The date and time of the download.
- The search term that was used.
- The search engine that provided the document.
- The length and SHA1 of the file.
- A Simple Dublin Core for the file.
Unfortunately, the metadata server is currently down.
A malware scan of the govdocs1 directory is now available fromÂ http://downloads.digitalcorpora.org/corpora/files/govdocs1/MetascanClientLog_201306281214.txt .
Forensic Innovations, Inc., has kindly made available the following analysis of the corpus using its FITools product:
Please feel free to let us know if you find this corpus Â is useful by leaving a comment below. If you decide to use this corpus in published research, the appropriate citation is:Â Garfinkel, Farrell, Roussev and Dinolt,Â Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada