Govdocs1

September 23rd, 2014 Leave a comment Go to comments

Files

Govdocs1 — (nearly) 1 million freely-redistributable files

In recent years a significant amount of forensic research has involved the analysis of files or file fragments. In the absence of such corpora, researchers and students who wish to work with files first need to collect files—a surprisingly difficult task if one wishes a large number of files of many types from a variety of sources. Although many files can be freely downloaded from the web, building and running a high performance document discovery and downloading tool is not a trivial task. Once files are downloaded they need to be analyzed, characterized and curated. Finally, many corpora that might be assembled cannot be easily redistributed due to privacy or copyright concerns.

For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.

Each file in the corpus is presented as a numbered file with a file extension (e.g. 0000001.jpg). The file extension is typically the file extension that was provided to us when the file was downloaded. The file extension is a suggestion—it is not part of the corpus.

We are making the corpus available in several ways:

Other metadata:

Note: Due to accidental over collection involving files from the State of California, approximately 13,722 files have been removed from the original corpus of 1 million files.

Metadata

The following metadata is provided for each of the files:

  • The URL from which the file was downloaded.
  • The date and time of the download.
  • The search term that was used.
  • The search engine that provided the document.
  • The length and SHA1 of the file.
  • A Simple Dublin Core for the file.

Malware

Please note that the files in this corpus are verbatim copies of files downloaded from USG webservers. We are aware that some of these files contain malware in the form of JavaScript exploits and Windows malware that was sent to mailing lists (that are now present in the mailing list archives). Although this may trigger some anti-virus programs, the malware will not be removed from the files because it is legitimately part of the corpus.

A malware scan of the govdocs1 directory is now available from http://digitalcorpora.org/corp/files/govdocs1/MetascanClientLog_201306281214.txt .

Analysis

Forensic Innovations, Inc., has kindly made available the following analysis of the corpus using its FITools product:

Citation

Please feel free to let us know if you find this corpus  is useful by leaving a comment below. If you decide to use this corpus in published research, the appropriate citation is: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with Standardized Forensic Corpora, DFRWS 2009, Montreal, Canada

  1. June 21st, 2010 at 08:05 | #1

    As a software vendor, I think this resource is extremely valuable. It has given us the opportunity to test our tools against a wide range of files collected from the wild. This is much more useful that test data that we have created ourselves. Many software applications create documents with variations, in their file structures, that can cause major problems when identifying and processing files. Another benefit is the opportunity for multiple software vendors to test their tools against a public collection and provide comparable product comparisons.

    Thank you for this valuable resource!

    Rob Zirnstein
    Forensic Innovations

  2. Mark S
    November 13th, 2012 at 09:05 | #2

    Your subset threads ZIP are no longer available on the FTP. Can you repost them?

  3. April 6th, 2013 at 16:46 | #3

    @Martin Thurn
    Fixed. Thanks.

  4. April 6th, 2013 at 16:47 | #4

    @Mark S
    The substet threads have been restored.

  5. labqt
    May 31st, 2013 at 22:38 | #5

    I kindly request to reset the subset threads of http://digitalcorpora.org/corp/files/govdocs1/ since as a student i am researching on whitelist files it could be good enough for me to study about files inside these folder

  6. Brad Hards
    July 19th, 2013 at 01:09 | #6

    The file and zipfile download links are now 403.

  7. Umit K.
    September 28th, 2013 at 11:56 | #7

    Very useful. Great database…

  8. September 30th, 2013 at 04:18 | #8

    @Brad Hards Fixed. Thanks. We had some issues with the server over the summer.

  9. September 30th, 2013 at 04:18 | #9

    @labqt You are free to download any of the files that you wish!

  10. Lee
    September 30th, 2013 at 13:25 | #10

    Hi, your CGI response refers to an XSD which is 404

  11. Brad Hards
    October 11th, 2013 at 03:39 | #11

    Is it possible to get a sha1sum (or sha256sum) output for the zipfile set?

  1. June 3rd, 2013 at 12:03 | #1
  2. June 3rd, 2013 at 12:13 | #2

 

"This material is based upon work supported by the National Science Foundation under Grant No. 0919593. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."