Govdocs1
Govdocs1 — (nearly) 1 million freely-redistributable files
For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
Each file in the corpus is presented as a numbered file with a file extension (e.g. 0000001.jpg). The file extension is typically the file extension that was provided to us when the file was downloaded. The file extension is a suggestion—it is not part of the corpus.
We are making the corpus available in several ways:
- A set of 1000 directories, with 1000 files in each directory, downloadable from our server at http://digitalcorpora.org/corp/nps/files/govdocs1/.
- A set of 1000 ZIP files, each with 1000 files, downloadable from our server at http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/.
- A a tar file with 109,282 JPEG pictures from the govdocs1m corpus at http://digitalcorpora.org/corp/nps/files/govdocs1/files.jpeg.tar.
- As a set of 10 subset “threads” (subset0.zip through subset9.zip), each one containing containing 1000 randomly chosen documents. These subsets were specifically created for to facilitate pilot studies and student research projects with the rationale that it’s easier to work with 1000 files than with 1 million. Students are encouraged to use one subset for development and another subset for testing.
- Through a http://digitalcorpora.org/corpora/files/search-govdocs1 that allows searching for any file by search term or URL fragment.
Note: Due to accidental over collection involving files from the State of California, approximately 13,722 files have been removed from the original corpus of 1 million files.
Metadata
The following metadata is provided for each of the files:
- The URL from which the file was downloaded.
- The date and time of the download.
- The search term that was used.
- The search engine that provided the document.
- The length and SHA1 of the file.
- A Simple Dublin Core for the file.
Metadata is available through a simple XMLRPC service. To look up document 333333, simply use this url: http://digitalcorpora.org/corp/nps/files/govdocs1/info.cgi?docid=333333
Malware
Please note that the files in this corpus are verbatim copies of files downloaded from USG webservers. We are aware that some of these files contain malware in the form of JavaScript exploits and Windows malware that was sent to mailing lists (that are now present in the mailing list archives). Although this may trigger some anti-virus programs, the malware will not be removed from the files because it is legitimately part of the corpus.
Analysis
Forensic Innovations, Inc., has kindly made available the following analysis of the corpus using its FITools product:
- Simple statistical report of the digital corpora.
- groundtruth-fitools.zip, the FITools analysis of every file in the corpus.
Citation
Please feel free to let us know if you find this corpus is useful by leaving a comment below. If you decide to use this corpus in published research, the appropriate citation is: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada
As a software vendor, I think this resource is extremely valuable. It has given us the opportunity to test our tools against a wide range of files collected from the wild. This is much more useful that test data that we have created ourselves. Many software applications create documents with variations, in their file structures, that can cause major problems when identifying and processing files. Another benefit is the opportunity for multiple software vendors to test their tools against a public collection and provide comparable product comparisons.
Thank you for this valuable resource!
Rob Zirnstein
Forensic Innovations
Your subset threads ZIP are no longer available on the FTP. Can you repost them?
When I click on the doc info link (sample http://digitalcorpora.org/corp/nps/files/govdocs1/info.cgi?docid=333333), it just displays the source code of the CGI program!?!
@Martin Thurn
Fixed. Thanks.
@Mark S
The substet threads have been restored.
I kindly request to reset the subset threads of http://digitalcorpora.org/corpora/files/corp/files/govdocs1/ since as a student i am researching on whitelist files it could be good enough for me to study about files inside these folder