As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.
We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:
Rank Count Value(s): ============================ 1 77227 .text -> .txt 2 9290 .xml -> .html 3 3683 .pdf -> .html 4 3565 . -> .html 5 2602 . -> .unk 6 2601 .xls -> .dbase3 7 2082 .text -> .unk 8 1943 . -> .pdf 9 1942 .text -> .html 10 1857 .doc -> .html 11 1088 .doc -> .rtf 12 620 .xls -> .html 13 595 .text -> .f 14 533 .text -> .xml 15 459 .ppt -> .html 16 438 .xls -> .txt 17 435 .doc -> .txt 18 346 .doc -> .wp 19 283 .txt -> .html 20 269 .eps -> .html 21 256 .log -> .html 22 253 .doc -> .unk 23 228 .swf -> .html 24 218 .xls -> .unk 25 179 .text -> .fits 26 175 .dwf -> .html 27 166 .gz -> .html 28 163 .sql -> .html 29 161 .text -> .tex 30 155 .html -> .xml 31 107 .html -> .pdf 32 96 .text -> .troff 33 94 .ps -> .html 34 70 .js -> .html 35 66 . -> .xml 36 60 .xls -> .gls 37 59 .ttf -> .txt 38 53 .text -> .sgml 39 45 .jpg -> .html 40 36 .ppt -> .txt 41 35 .csv -> .html 35 .ttf -> .html 43 30 .ppt -> .unk 44 29 .text -> .pdf 29 .xbm -> .txt 46 26 .java -> .html 26 .zip -> .html 48 25 .doc -> .fm 49 22 .text -> .rtf 50 21 .pub -> .html 51 20 .js -> .txt 52 17 .jar -> .html 17 .jar -> .txt 17 .text -> .gz 55 16 .ps -> .pdf 56 15 .ppt -> .doc 57 14 .text -> .swf 14 .tmp -> .html 14 .xbm -> .html 60 13 .doc -> .pdf 13 .doc -> .troff 62 9 .pps -> .html 9 .xlsx -> .html 64 8 .log -> .txt 65 7 . -> .rtf 7 .dll -> .html 7 .kml -> .html 7 .xls -> .wk1 (Lotus Notes) 69 6 .doc -> .f 6 .kmz -> .html 6 .xml -> .txt 72 5 . -> .txt 5 .doc -> .sgml 5 .docx -> .html 5 .eps -> .pdf 5 .exe -> .html 5 .html -> .rtf 78 4 .doc -> .ileaf (Interleaf) 4 .ppt -> .zip 4 .pptx -> .html 4 .text -> .doc 4 .text -> .kml 4 .xls -> .zip 84 3 .bmp -> .html 3 .jpeg -> .html 3 .ppt -> .sgml 3 .text -> .wp 3 .tif -> .html 3 .xls -> .doc 3 .xls -> .xml 91 2 .exported -> .html 2 .ppt -> .appledouble (AppleDouble encoded Macintosh file ) 2 .ppt -> .odp 2 .ppt -> .gd 2 .tmp -> .xml 2 .xls -> .123 2 .xls -> .lnk (MS Windows shortcut ) 2 .xls -> .pdf 99 1 .csv -> .rtf 1 .doc -> .par 1 .doc -> .zip 1 .doc -> .fits 1 .doc -> .gz 1 .doc -> .icns 1 .doc -> .tex 1 .doc -> .xls 1 .doc -> .xml 1 .docx -> .pdf 1 .hlp -> .html 1 .hmtl -> .html 1 .html -> .gif 1 .html -> .kml 1 .kml -> .xml 1 .pdf -> .xml 1 .ppt -> .pdf 1 .sql -> .txt 1 .sys -> .rtf 1 .wp -> .pdf 1 .wp -> .rtf 1 .xls -> .wk3 1 .xls -> .bin (mc68020 pure executable ) 1 .xls -> .f 1 .xls -> .sgml 1 .xml -> .kml
We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.