CC-MAIN-2021-31-PDF-UNTRUNCATED Metadata

Crawl provenance metadata

The table cc-provenance-20230303.csv.gz contains all provenance information from the crawl (8,410,704 rows, including the header).

  • url_id — primary key for each URL fetched or refetched
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • url — target url extracted from Common Crawl’s index files. Max length in this set is 6,771 characters.
  • cc_digest — digest calculated by Common Crawl and extracted from the index files
  • cc_http_mime — MIME as extracted from Common Crawl’s index files — this is derived from the http header
  • cc_detected_mime — the detected MIME, as extracted from Common Crawl’s index files.
  • cc_warc_file_name — the Common Crawl warc file where the file’s individual warc file is stored
  • cc_warc_start — the offset within the cc_warc_file where the individual warc file is stored
  • cc_warc_end — this is the end of the individual warc file within the larger cc_warc_file
  • cc_truncated — this is Common Crawl’s code for why the file was truncated if the file was truncated. This information was extracted from Common Crawl’s indices. Values include:
    • '' (6,383,873) — (empty string) — Common Crawls records this as not truncated
    • length (2,020,913) — the file was truncated because of length
    • disconnect (5,861) — there was a network disconnection during Common Crawl’s original fetch
    • time (56) — there was a timeout during Common Crawl’s original fetch
  • fetched_status — records our project’s status for obtaining the file. Values include:
    • ADDED_TO_REPOSITORY (6,377,619) — extracted directly from the Common Crawl data
    • REFETCHED_SUCCESS (1,922,505) — our project refetched content from the original target URL
    • REFETCH_UNHAPPY_HOST (53,038) — we tried to refetch a URL, but the failures from that host exceeded our threshold. (We didn’t want to bother a host that had refused our refetches)
    • REFETCHED_IO_EXCEPTION_READING_ENTITY (45,561) — during our refetch, there was an IOException while trying to read the contents
    • EMPTY_PAYLOAD (5,719) — There was an empty payload in the Common Crawl warc file.
    • REFETCHED_TIMEOUT (5,157) — timeout during our attempted refetch.
    • REFETCHED_IO_EXCEPTION (569) — general IOException while we were trying to refetch.
    • null (506) — ??
    • FETCHED_EXCEPTION_EMITTING (29) — there was an exception when we tried to write a refetched PDF to S3
  • fetched_digest — the sha256 that we calculated on the bytes that we have for the file, whether fetched from CC or refetched
  • fetched_length — the length in bytes of the file that we extracted from Common Crawl or refetched
mimecount
application/pdf8,156,384
application/octet-stream145,722
text/html22,901
application/download14,011
application/force-download12,740
unk11,460
content-type:7,153
pdf7,114
application/x-download6,078
binary/octet-stream2,166
Top 10 cc_http_mime values

mimecount
application/pdf8,389,207
text/html16,515
text/plain3,049
application/xhtml+xml814
application/pkcs7-signature210
application/x-tika-ooxml142
image/jpeg117
application/xml96
application/octet-stream78
application/gzip76
Top 10 cc_detected_mime values

Hosts provenance metadata

The cc-hosts-20230303.csv.gz contains information about the hosts and, where possible, the geographic location of the host for each PDF (8,410,704 rows, including the header). The columns include:

  • url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • host — host
  • tld — top level domain
  • ip_address — as retrieved from Common Crawl or captured during refetch
  • country, latitude and longitude — as geolocated by MaxMind’s geolite2

Of the 8.3 million URLs for which we have a file, the counts for the top 10 countries:

Country CodeCount
US3,259,209
DE896,990
FR462,215
JP364,303
GB268,950
IT228,065
NL206,389
RU176,947
CA175,853
ES173,619
Top 10 country codes

pdfinfo utility metadata

The pdfinfo-20230315.csv.gz contains output from pdfinfo (poppler version=23.03.0, data version=0.4.12). We ran this in a Docker container based on debian:bullseye-20230227-slim with the -isodates flag and a timeout of 2 minutes.

  • url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • parse_time_millis — milliseconds to process the file
  • exit_value — exit value for the pdfinfo process
  • timeout — boolean for whether or not the process timed out (exit_value= -1 in the 2 records where this happens)
  • stderr — stderr stream from pdfinfo (limited to first 1,024 characters)
  • pdf_version — PDF version from the header comment line at the start of the PDF file
  • creator — PDF creator tool from Document Information dictionary (limited to first 1,024 characters)
  • producer — PDF producer from Document Information dictionary (limited to first 1,024 characters)
  • created — date created from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
  • modified — date modified from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
  • custom_metadata — whether or not there is custom metadata (non-standard keys in the Document Information dictionary)
  • metadata_stream — whether or not there is an XMP Metadata stream (Document Catalog Metadata key)
  • tagged — whether the PDF is a Tagged PDF (Mark Information dictionary Marked key)
  • user_properties — contains user properties (Mark Information dictionary UserProperties key)
  • form — PDF is a form: XFA, AcroForm, ‘null’ or ” (empty)
  • javascript — PDF contains JavaScript (ECMAscript)
  • pages — number of pages according to the PDF page tree
  • page_size — string representing page size of the first page (in pts, 1/72 inch)
  • page_rotation — the page rotation of the first page (raw, as specified by the Rotate key)
  • optimized — is the PDF file is Linearized (a.k.a. “Fast web view” enabled)
Exit ValueCountNotes
07,893,956Completed normally
137,692May not be a PDF file (21,837), Encrypted file (4,295), other problem
991,185Wrong page range given (1,095) typically page tree has 0 pages?!
-12timeout
1null0 byte file
pdfinfo exit values

Apache Tika metadata — Overview

There are two Apache Tika metadata tables.

  • tika-20230714.csv.gz — this includes metadata extracted and/or calculated by Apache Tika on the primary container/input file. Each row represents the metadata for a given input file as fetched from a specific URL. As in the other tables, these tables are “URL” based, which means that an identical file (as calculated by SHA-256) may appear several times in the file.
  • tika-with-attachments-20230714.csv.gz — this includes metadata extracted and/or calculated by Apache Tika on the primary container/input files and on their attachments. Each row represents an input file or its attachment(s) for a given URL

We ran a development version of Tika between versions 2.8.0 and 2.8.1. We turned off Apache Tika’s integration with tesseract-ocr. We also turned off processing of images that were intended to be rendered.

Apache Tika metadata — Container file

tika-20230714.csv.gz

  • url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
  • file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
  • parse_status — options: OK,PARSE_EXCEPTION,TIMEOUT, OOM, UNSPECIFIED_CRASH
  • parse_time_millis — milliseconds to process the file
  • mime — the file type as identified by Apache Tika
  • macro_count — the number of macros/javascript files in the container file. This does not include counts of macros embedded within embedded files.
  • attachment_count — the number of attachments in the container file. An attachment is a file that was attached to/embedded in the container file and is intended to exist as a standalone file. We do not include in these counts — image files, inline images, font files, ICC profiles or any other embedded files that are used for the rendering or functionality of the container file. Tika looks for “attachments” in PDFs by looking for the /FileSpec keys in the PDF.
  • created — date created (XMP is preferred over the standard metadata dictionary if both exist).
  • modified — date modified (XMP is preferred over the standard metadata dictionary if both exist).
  • encrypted — whether the file is encrypted or not
  • has_xfa — whether the PDF has XFA
  • has_xmp — whether the PDF has XMP
  • has_collection — whether the PDF has a collection and is a portfolio PDF
  • has_marked_content — whether the PDF has marked content
  • num_pages — the number of pages
  • xmp_creator_tool — the creator tool (XMP is preferred over the standard metadata dictionary if both exist)
  • pdf_producer — the producer
  • pdf_version — the PDF version as identified by PDFBox
  • pdfa_version — the PDF/A version if this file identifies as a PDF/A
  • pdfuaid_part — the PDF/UA id part if the file identifies as a PDF/UA
  • pdfx_conformance — the PDF/X conformance if the file identifies as a PDF/X
  • pdfx_version — the PDF/X version if the file identifies as PDF/X
  • pdfxid_version — the PDF/X id if the file identifies as PDF/X
  • pdfvt_version — the PDF/VT version if the file identifies as PDF/VT
  • pdf_num_3d_annotations — the number of 3D annotations in the container file
  • pdf_has_acroform_fields — whether the PDF has AcroForm fields
  • pdf_incremental_updates — the number of incremental updates as counted by Apache Tika’s rough heuristic of scanning for startxref and %%EOF
  • pdf_overall_unmapped_unicode_chars — the percentage of characters extracted from the PDF that do not have Unicode mappings.
  • pdf_contains_damaged_font — whether PDFBox identifies a damaged font
  • pdf_contains_non_embedded_font — whether PDFBox identifies a non-embedded font
  • has_signature — whether the file has a digital signature. This can be true of PDFs and MSOffice files.
  • locationlatitude,longitude when extracted from the metadata of a file (e.g. EXIF metadata); applies to embedded files, not as much to container files that are PDFs
  • tika_eval_num_tokens — the number of tokens (words) that were counted in the extracted text by the tika-eval module
  • tika_eval_num_alpha_tokens — the number of alphabetic tokens (words) that were counted in the extracted text by the tika-eval module
  • tika_eval_lang — the language as identified by tika-eval‘s language detector on the extracted text (statistical classifier based on character frequencies)
  • tika_eval_oov — the out of vocabulary statistic as calculated by tika-eval. After running language identification on the extracted text, the tika-eval module counts how many words in the extracted text were in the top 20k most common words for the identified language. When there are enough tokens (> 100) and this value is high, that may indicate that the extracted text is garbled.
  • container_exception — the stacktrace if there was a parse exception on the file

Apache Tika metadata — Container file with Attachments

tika-with-attachments-20230714.csv.gz

  • attachment_num — if the file is an attachment, this is the attachment number within the primary container file
  • emb_depth — the embedded depth of the attachment (if an attachment)
  • embedded_id — a unique id for the embedded file
  • embedded_id_path — the path based on the unique ids for the embedded file. For example if a PDF has an attached MSG file (id=1) that itself has an attached DOCX file (id=2), then the path for the DOCX file would be /1/2
  • embedded_resource_type — whether this was an ATTACHMENT or a MACRO
  • embedded_exception — the stacktrace on the embedded file if there was a catchable parse exception thrown during the processing of the embedded file