CC-MAIN-2021-31-PDF-UNTRUNCATED Metadata

Crawl provenance metadata

The table cc-provenance-20230303.csv.gz contains all provenance information from the crawl (8,410,704 rows, including the header).

  • url_id — primary key for each URL fetched or refetched
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • url — target url extracted from Common Crawl’s index files. Max length in this set is 6,771 characters.
  • cc_digest — digest calculated by Common Crawl and extracted from the index files
  • cc_http_mime — MIME as extracted from Common Crawl’s index files — this is derived from the http header
  • cc_detected_mime — the detected MIME, as extracted from Common Crawl’s index files.
  • cc_warc_file_name — the Common Crawl warc file where the file’s individual warc file is stored
  • cc_warc_start — the offset within the cc_warc_file where the individual warc file is stored
  • cc_warc_end — this is the end of the individual warc file within the larger cc_warc_file
  • cc_truncated — this is Common Crawl’s code for why the file was truncated if the file was truncated. This information was extracted from Common Crawl’s indices. Values include:
    • '' (6,383,873) — (empty string) — Common Crawls records this as not truncated
    • length (2,020,913) — the file was truncated because of length
    • disconnect (5,861) — there was a network disconnection during Common Crawl’s original fetch
    • time (56) — there was a timeout during Common Crawl’s original fetch
  • fetched_status — records our project’s status for obtaining the file. Values include:
    • ADDED_TO_REPOSITORY (6,377,619) — extracted directly from the Common Crawl data
    • REFETCHED_SUCCESS (1,922,505) — our project refetched content from the original target URL
    • REFETCH_UNHAPPY_HOST (53,038) — we tried to refetch a URL, but the failures from that host exceeded our threshold. (We didn’t want to bother a host that had refused our refetches)
    • REFETCHED_IO_EXCEPTION_READING_ENTITY (45,561) — during our refetch, there was an IOException while trying to read the contents
    • EMPTY_PAYLOAD (5,719) — There was an empty payload in the Common Crawl warc file.
    • REFETCHED_TIMEOUT (5,157) — timeout during our attempted refetch.
    • REFETCHED_IO_EXCEPTION (569) — general IOException while we were trying to refetch.
    • null (506) — ??
    • FETCHED_EXCEPTION_EMITTING (29) — there was an exception when we tried to write a refetched PDF to S3
  • fetched_digest — the sha256 that we calculated on the bytes that we have for the file, whether fetched from CC or refetched
  • fetched_length — the length in bytes of the file that we extracted from Common Crawl or refetched
mimecount
application/pdf8,156,384
application/octet-stream145,722
text/html22,901
application/download14,011
application/force-download12,740
unk11,460
content-type:7,153
pdf7,114
application/x-download6,078
binary/octet-stream2,166
Top 10 cc_http_mime values

mimecount
application/pdf8,389,207
text/html16,515
text/plain3,049
application/xhtml+xml814
application/pkcs7-signature210
application/x-tika-ooxml142
image/jpeg117
application/xml96
application/octet-stream78
application/gzip76
Top 10 cc_detected_mime values

Hosts provenance metadata

The cc-hosts-20230303.csv.gz contains information about the hosts and, where possible, the geographic location of the host for each PDF (8,410,704 rows, including the header). The columns include:

  • url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • host — host
  • tld — top level domain
  • ip_address — as retrieved from Common Crawl or captured during refetch
  • country, latitude and longitude — as geolocated by MaxMind’s geolite2

Of the 8.3 million URLs for which we have a file, the counts for the top 10 countries:

Country CodeCount
US3,259,209
DE896,990
FR462,215
JP364,303
GB268,950
IT228,065
NL206,389
RU176,947
CA175,853
ES173,619
Top 10 country codes

pdfinfo utility metadata

The pdfinfo-20230315.csv.gz contains output from pdfinfo (poppler version=23.03.0, data version=0.4.12). We ran this in a Docker container based on debian:bullseye-20230227-slim with the -isodates flag and a timeout of 2 minutes.

  • url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • parse_time_millis — milliseconds to process the file
  • exit_value — exit value for the pdfinfo process
  • timeout — boolean for whether or not the process timed out (exit_value= -1 in the 2 records where this happens)
  • stderr — stderr stream from pdfinfo (limited to first 1,024 characters)
  • pdf_version — PDF version from the header comment line at the start of the PDF file
  • creator — PDF creator tool from Document Information dictionary (limited to first 1,024 characters)
  • producer — PDF producer from Document Information dictionary (limited to first 1,024 characters)
  • created — date created from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
  • modified — date modified from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
  • custom_metadata — whether or not there is custom metadata (non-standard keys in the Document Information dictionary)
  • metadata_stream — whether or not there is an XMP Metadata stream (Document Catalog Metadata key)
  • tagged — whether the PDF is a Tagged PDF (Mark Information dictionary Marked key)
  • user_properties — contains user properties (Mark Information dictionary UserProperties key)
  • form — PDF is a form: XFA, AcroForm, ‘null’ or ” (empty)
  • javascript — PDF contains JavaScript (ECMAscript)
  • pages — number of pages according to the PDF page tree
  • page_size — string representing page size of the first page (in pts, 1/72 inch)
  • page_rotation — the page rotation of the first page (raw, as specified by the Rotate key)
  • optimized — is the PDF file is Linearized (a.k.a. “Fast web view” enabled)
Exit ValueCountNotes
07,893,956Completed normally
137,692May not be a PDF file (21,837), Encrypted file (4,295), other problem
991,185Wrong page range given (1,095) typically page tree has 0 pages?!
-12timeout
1null0 byte file
pdfinfo exit values