Archive

Archive for the ‘Files’ Category

35GB of JPEGs ready for download

March 7th, 2012 No comments

We have created a tar file with 109,282 files from the govdocs1m corpus. You can download it from:

http://digitalcorpora.org/corp/nps/files/govdocs1/files.jpeg.tar

 

Categories: Files Tags:

First 512 and 4096 byte block hashes of govdocs1

January 4th, 2011 No comments

I have posted a text file containing MD5 hashes for the first 512 bytes and first 4096 bytes of every file in the GOVDOCS1 corpus. This file is intended for research on sector hashing. You can download the file from http://digitalcorpora.org/corp/nps/files/govdocs1/govdocs1-first512-first4096-docid.txt

Categories: Files Tags:

Announcing GOVDOCS1.1

December 4th, 2010 No comments

As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.

We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:

  Rank     Count     Value(s):
  ============================
      1     77227      .text -> .txt
      2      9290      .xml -> .html
      3      3683      .pdf -> .html
      4      3565      . -> .html
      5      2602      . -> .unk
      6      2601      .xls -> .dbase3
      7      2082      .text -> .unk
      8      1943      . -> .pdf
      9      1942      .text -> .html
     10      1857      .doc -> .html
     11      1088      .doc -> .rtf
     12       620      .xls -> .html
     13       595      .text -> .f
     14       533      .text -> .xml
     15       459      .ppt -> .html
     16       438      .xls -> .txt
     17       435      .doc -> .txt
     18       346      .doc -> .wp
     19       283      .txt -> .html
     20       269      .eps -> .html
     21       256      .log -> .html
     22       253      .doc -> .unk
     23       228      .swf -> .html
     24       218      .xls -> .unk
     25       179      .text -> .fits
     26       175      .dwf -> .html
     27       166      .gz -> .html
     28       163      .sql -> .html
     29       161      .text -> .tex
     30       155      .html -> .xml
     31       107      .html -> .pdf
     32        96      .text -> .troff
     33        94      .ps -> .html
     34        70      .js -> .html
     35        66      . -> .xml
     36        60      .xls -> .gls
     37        59      .ttf -> .txt
     38        53      .text -> .sgml
     39        45      .jpg -> .html
     40        36      .ppt -> .txt
     41        35      .csv -> .html
               35      .ttf -> .html
     43        30      .ppt -> .unk
     44        29      .text -> .pdf
               29      .xbm -> .txt
     46        26      .java -> .html
               26      .zip -> .html
     48        25      .doc -> .fm
     49        22      .text -> .rtf
     50        21      .pub -> .html
     51        20      .js -> .txt
     52        17      .jar -> .html
               17      .jar -> .txt
               17      .text -> .gz
     55        16      .ps -> .pdf
     56        15      .ppt -> .doc
     57        14      .text -> .swf
               14      .tmp -> .html
               14      .xbm -> .html
     60        13      .doc -> .pdf
               13      .doc -> .troff
     62         9      .pps -> .html
                9      .xlsx -> .html
     64         8      .log -> .txt
     65         7      . -> .rtf
                7      .dll -> .html
                7      .kml -> .html
                7      .xls -> .wk1  (Lotus Notes)
     69         6      .doc -> .f
                6      .kmz -> .html
                6      .xml -> .txt
     72         5      . -> .txt
                5      .doc -> .sgml
                5      .docx -> .html
                5      .eps -> .pdf
                5      .exe -> .html
                5      .html -> .rtf
     78         4      .doc -> .ileaf  (Interleaf)
                4      .ppt -> .zip
                4      .pptx -> .html
                4      .text -> .doc
                4      .text -> .kml
                4      .xls -> .zip
     84         3      .bmp -> .html
                3      .jpeg -> .html
                3      .ppt -> .sgml
                3      .text -> .wp
                3      .tif -> .html
                3      .xls -> .doc
                3      .xls -> .xml
     91         2      .exported -> .html
                2      .ppt -> .appledouble (AppleDouble encoded Macintosh file  )
                2      .ppt -> .odp
                2      .ppt -> .gd
                2      .tmp -> .xml
                2      .xls -> .123
                2      .xls -> .lnk (MS Windows shortcut  )
                2      .xls -> .pdf
     99         1      .csv -> .rtf
                1      .doc -> .par
                1      .doc -> .zip
                1      .doc -> .fits
                1      .doc -> .gz
                1      .doc -> .icns
                1      .doc -> .tex
                1      .doc -> .xls
                1      .doc -> .xml
                1      .docx -> .pdf
                1      .hlp -> .html
                1      .hmtl -> .html
                1      .html -> .gif
                1      .html -> .kml
                1      .kml -> .xml
                1      .pdf -> .xml
                1      .ppt -> .pdf
                1      .sql -> .txt
                1      .sys -> .rtf
                1      .wp -> .pdf
                1      .wp -> .rtf
                1      .xls -> .wk3
                1      .xls -> .bin  (mc68020 pure executable  )
                1      .xls -> .f
                1      .xls -> .sgml
                1      .xml -> .kml

You can download the script to perform the fixes from: http://domex.nps.edu/corp/files/govdocs1/fixgovdocs1.zip

We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.

Categories: Files Tags:
"This material is based upon work supported by the National Science Foundation under Grant No. 0919593. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."