As an artifact of the way that it was collected, many of the extensions for the files in the NPS GOVDOCS1 corpus did not reflect the type of the underlying file. For example, many files that were labeled ‘.xls’ did not contain Microsoft Excel spreadsheets, but instead contained HTML error messages from US government web servers indicating that the file was no longer available. In other cases file extensions chosen when the document was created no longer match current usage, as was the case with several files that had a ‘.doc’ extension but where actually WordPerfect files.
We have gone through the corpus and created a shell script that renames the files to current usage. The script contains 115,135 lines. Of these, the following renames are implemented:
Rank Count Value(s):
============================
1 77227 .text -> .txt
2 9290 .xml -> .html
3 3683 .pdf -> .html
4 3565 . -> .html
5 2602 . -> .unk
6 2601 .xls -> .dbase3
7 2082 .text -> .unk
8 1943 . -> .pdf
9 1942 .text -> .html
10 1857 .doc -> .html
11 1088 .doc -> .rtf
12 620 .xls -> .html
13 595 .text -> .f
14 533 .text -> .xml
15 459 .ppt -> .html
16 438 .xls -> .txt
17 435 .doc -> .txt
18 346 .doc -> .wp
19 283 .txt -> .html
20 269 .eps -> .html
21 256 .log -> .html
22 253 .doc -> .unk
23 228 .swf -> .html
24 218 .xls -> .unk
25 179 .text -> .fits
26 175 .dwf -> .html
27 166 .gz -> .html
28 163 .sql -> .html
29 161 .text -> .tex
30 155 .html -> .xml
31 107 .html -> .pdf
32 96 .text -> .troff
33 94 .ps -> .html
34 70 .js -> .html
35 66 . -> .xml
36 60 .xls -> .gls
37 59 .ttf -> .txt
38 53 .text -> .sgml
39 45 .jpg -> .html
40 36 .ppt -> .txt
41 35 .csv -> .html
35 .ttf -> .html
43 30 .ppt -> .unk
44 29 .text -> .pdf
29 .xbm -> .txt
46 26 .java -> .html
26 .zip -> .html
48 25 .doc -> .fm
49 22 .text -> .rtf
50 21 .pub -> .html
51 20 .js -> .txt
52 17 .jar -> .html
17 .jar -> .txt
17 .text -> .gz
55 16 .ps -> .pdf
56 15 .ppt -> .doc
57 14 .text -> .swf
14 .tmp -> .html
14 .xbm -> .html
60 13 .doc -> .pdf
13 .doc -> .troff
62 9 .pps -> .html
9 .xlsx -> .html
64 8 .log -> .txt
65 7 . -> .rtf
7 .dll -> .html
7 .kml -> .html
7 .xls -> .wk1 (Lotus Notes)
69 6 .doc -> .f
6 .kmz -> .html
6 .xml -> .txt
72 5 . -> .txt
5 .doc -> .sgml
5 .docx -> .html
5 .eps -> .pdf
5 .exe -> .html
5 .html -> .rtf
78 4 .doc -> .ileaf (Interleaf)
4 .ppt -> .zip
4 .pptx -> .html
4 .text -> .doc
4 .text -> .kml
4 .xls -> .zip
84 3 .bmp -> .html
3 .jpeg -> .html
3 .ppt -> .sgml
3 .text -> .wp
3 .tif -> .html
3 .xls -> .doc
3 .xls -> .xml
91 2 .exported -> .html
2 .ppt -> .appledouble (AppleDouble encoded Macintosh file )
2 .ppt -> .odp
2 .ppt -> .gd
2 .tmp -> .xml
2 .xls -> .123
2 .xls -> .lnk (MS Windows shortcut )
2 .xls -> .pdf
99 1 .csv -> .rtf
1 .doc -> .par
1 .doc -> .zip
1 .doc -> .fits
1 .doc -> .gz
1 .doc -> .icns
1 .doc -> .tex
1 .doc -> .xls
1 .doc -> .xml
1 .docx -> .pdf
1 .hlp -> .html
1 .hmtl -> .html
1 .html -> .gif
1 .html -> .kml
1 .kml -> .xml
1 .pdf -> .xml
1 .ppt -> .pdf
1 .sql -> .txt
1 .sys -> .rtf
1 .wp -> .pdf
1 .wp -> .rtf
1 .xls -> .wk3
1 .xls -> .bin (mc68020 pure executable )
1 .xls -> .f
1 .xls -> .sgml
1 .xml -> .kml
We will be remaking the ZIP files over the next few days and will replace the ZIP files and update the searchable database by 7 December 2010.