CC-MAIN-2021-31-PDF-UNTRUNCATED Metadata

Crawl provenance metadata

The table cc-provenance-20230303.csv.gz contains all provenance information from the crawl (8,410,704 rows, including the header).

url_id — primary key for each URL fetched or refetched
file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
url — target url extracted from Common Crawl’s index files. Max length in this set is 6,771 characters.
cc_digest — digest calculated by Common Crawl and extracted from the index files
cc_http_mime — MIME as extracted from Common Crawl’s index files — this is derived from the http header
cc_detected_mime — the detected MIME, as extracted from Common Crawl’s index files.
cc_warc_file_name — the Common Crawl warc file where the file’s individual warc file is stored
cc_warc_start — the offset within the cc_warc_file where the individual warc file is stored
cc_warc_end — this is the end of the individual warc file within the larger cc_warc_file
cc_truncated — this is Common Crawl’s code for why the file was truncated if the file was truncated. This information was extracted from Common Crawl’s indices. Values include:
- '' (6,383,873) — (empty string) — Common Crawls records this as not truncated
- length (2,020,913) — the file was truncated because of length
- disconnect (5,861) — there was a network disconnection during Common Crawl’s original fetch
- time (56) — there was a timeout during Common Crawl’s original fetch
fetched_status — records our project’s status for obtaining the file. Values include:
- ADDED_TO_REPOSITORY (6,377,619) — extracted directly from the Common Crawl data
- REFETCHED_SUCCESS (1,922,505) — our project refetched content from the original target URL
- REFETCH_UNHAPPY_HOST (53,038) — we tried to refetch a URL, but the failures from that host exceeded our threshold. (We didn’t want to bother a host that had refused our refetches)
- REFETCHED_IO_EXCEPTION_READING_ENTITY (45,561) — during our refetch, there was an IOException while trying to read the contents
- EMPTY_PAYLOAD (5,719) — There was an empty payload in the Common Crawl warc file.
- REFETCHED_TIMEOUT (5,157) — timeout during our attempted refetch.
- REFETCHED_IO_EXCEPTION (569) — general IOException while we were trying to refetch.
- null (506) — ??
- FETCHED_EXCEPTION_EMITTING (29) — there was an exception when we tried to write a refetched PDF to S3
fetched_digest — the sha256 that we calculated on the bytes that we have for the file, whether fetched from CC or refetched
fetched_length — the length in bytes of the file that we extracted from Common Crawl or refetched

mime	count
application/pdf	8,156,384
application/octet-stream	145,722
text/html	22,901
application/download	14,011
application/force-download	12,740
unk	11,460
content-type:	7,153
pdf	7,114
application/x-download	6,078
binary/octet-stream	2,166

Top 10 cc_http_mime values

mime	count
application/pdf	8,389,207
text/html	16,515
text/plain	3,049
application/xhtml+xml	814
application/pkcs7-signature	210
application/x-tika-ooxml	142
image/jpeg	117
application/xml	96
application/octet-stream	78
application/gzip	76

Top 10 cc_detected_mime values

Hosts provenance metadata

The cc-hosts-20230303.csv.gz contains information about the hosts and, where possible, the geographic location of the host for each PDF (8,410,704 rows, including the header). The columns include:

url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
host — host
tld — top level domain
ip_address — as retrieved from Common Crawl or captured during refetch
country, latitude and longitude — as geolocated by MaxMind’s geolite2

Of the 8.3 million URLs for which we have a file, the counts for the top 10 countries:

Country Code	Count
US	3,259,209
DE	896,990
FR	462,215
JP	364,303
GB	268,950
IT	228,065
NL	206,389
RU	176,947
CA	175,853
ES	173,619

Top 10 country codes

`pdfinfo` utility metadata

The pdfinfo-20230315.csv.gz contains output from pdfinfo (poppler version=23.03.0, data version=0.4.12). We ran this in a Docker container based on debian:bullseye-20230227-slim with the -isodates flag and a timeout of 2 minutes.

url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
parse_time_millis — milliseconds to process the file
exit_value — exit value for the pdfinfo process
timeout — boolean for whether or not the process timed out (exit_value= -1 in the 2 records where this happens)
stderr — stderr stream from pdfinfo (limited to first 1,024 characters)
pdf_version — PDF version from the header comment line at the start of the PDF file
creator — PDF creator tool from Document Information dictionary (limited to first 1,024 characters)
producer — PDF producer from Document Information dictionary (limited to first 1,024 characters)
created — date created from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
modified — date modified from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)
custom_metadata — whether or not there is custom metadata (non-standard keys in the Document Information dictionary)
metadata_stream — whether or not there is an XMP Metadata stream (Document Catalog Metadata key)
tagged — whether the PDF is a Tagged PDF (Mark Information dictionary Marked key)
user_properties — contains user properties (Mark Information dictionary UserProperties key)
form — PDF is a form: XFA, AcroForm, ‘null’ or ” (empty)
javascript — PDF contains JavaScript (ECMAscript)
pages — number of pages according to the PDF page tree
page_size — string representing page size of the first page (in pts, 1/72 inch)
page_rotation — the page rotation of the first page (raw, as specified by the Rotate key)
optimized — is the PDF file is Linearized (a.k.a. “Fast web view” enabled)

Exit Value	Count	Notes
0	7,893,956	Completed normally
1	37,692	May not be a PDF file (21,837), Encrypted file (4,295), other problem
99	1,185	Wrong page range given (1,095) typically page tree has 0 pages?!
-1	2	timeout
1	null	0 byte file

pdfinfo exit values

`Apache Tika` metadata — Overview

There are two Apache Tika metadata tables.

tika-20230714.csv.gz — this includes metadata extracted and/or calculated by Apache Tika on the primary container/input file. Each row represents the metadata for a given input file as fetched from a specific URL. As in the other tables, these tables are “URL” based, which means that an identical file (as calculated by SHA-256) may appear several times in the file.
tika-with-attachments-20230714.csv.gz — this includes metadata extracted and/or calculated by Apache Tika on the primary container/input files and on their attachments. Each row represents an input file or its attachment(s) for a given URL

We ran a development version of Tika between versions 2.8.0 and 2.8.1. We turned off Apache Tika’s integration with tesseract-ocr. We also turned off processing of images that were intended to be rendered.

Apache Tika metadata — Container file

tika-20230714.csv.gz

url_id — primary key for each URL fetched or refetched. This key can be joined with the url_id in the cc-provenance-20230303.csv.gz table.
file_name — name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.
parse_status — options: OK,PARSE_EXCEPTION,TIMEOUT, OOM, UNSPECIFIED_CRASH
parse_time_millis — milliseconds to process the file
mime — the file type as identified by Apache Tika
macro_count — the number of macros/javascript files in the container file. This does not include counts of macros embedded within embedded files.
attachment_count — the number of attachments in the container file. An attachment is a file that was attached to/embedded in the container file and is intended to exist as a standalone file. We do not include in these counts — image files, inline images, font files, ICC profiles or any other embedded files that are used for the rendering or functionality of the container file. Tika looks for “attachments” in PDFs by looking for the /FileSpec keys in the PDF.
created — date created (XMP is preferred over the standard metadata dictionary if both exist).
modified — date modified (XMP is preferred over the standard metadata dictionary if both exist).
encrypted — whether the file is encrypted or not
has_xfa — whether the PDF has XFA
has_xmp — whether the PDF has XMP
has_collection — whether the PDF has a collection and is a portfolio PDF
has_marked_content — whether the PDF has marked content
num_pages — the number of pages
xmp_creator_tool — the creator tool (XMP is preferred over the standard metadata dictionary if both exist)
pdf_producer — the producer
pdf_version — the PDF version as identified by PDFBox
pdfa_version — the PDF/A version if this file identifies as a PDF/A
pdfuaid_part — the PDF/UA id part if the file identifies as a PDF/UA
pdfx_conformance — the PDF/X conformance if the file identifies as a PDF/X
pdfx_version — the PDF/X version if the file identifies as PDF/X
pdfxid_version — the PDF/X id if the file identifies as PDF/X
pdfvt_version — the PDF/VT version if the file identifies as PDF/VT
pdf_num_3d_annotations — the number of 3D annotations in the container file
pdf_has_acroform_fields — whether the PDF has AcroForm fields
pdf_incremental_updates — the number of incremental updates as counted by Apache Tika’s rough heuristic of scanning for startxref and %%EOF
pdf_overall_unmapped_unicode_chars — the percentage of characters extracted from the PDF that do not have Unicode mappings.
pdf_contains_damaged_font — whether PDFBox identifies a damaged font
pdf_contains_non_embedded_font — whether PDFBox identifies a non-embedded font
has_signature — whether the file has a digital signature. This can be true of PDFs and MSOffice files.
location – latitude,longitude when extracted from the metadata of a file (e.g. EXIF metadata); applies to embedded files, not as much to container files that are PDFs
tika_eval_num_tokens — the number of tokens (words) that were counted in the extracted text by the tika-eval module
tika_eval_num_alpha_tokens — the number of alphabetic tokens (words) that were counted in the extracted text by the tika-eval module
tika_eval_lang — the language as identified by tika-eval‘s language detector on the extracted text (statistical classifier based on character frequencies)
tika_eval_oov — the out of vocabulary statistic as calculated by tika-eval. After running language identification on the extracted text, the tika-eval module counts how many words in the extracted text were in the top 20k most common words for the identified language. When there are enough tokens (> 100) and this value is high, that may indicate that the extracted text is garbled.
container_exception — the stacktrace if there was a parse exception on the file

Apache Tika metadata — Container file with Attachments

tika-with-attachments-20230714.csv.gz

attachment_num — if the file is an attachment, this is the attachment number within the primary container file
emb_depth — the embedded depth of the attachment (if an attachment)
embedded_id — a unique id for the embedded file
embedded_id_path — the path based on the unique ids for the embedded file. For example if a PDF has an attached MSG file (id=1) that itself has an attached DOCX file (id=2), then the path for the DOCX file would be /1/2
embedded_resource_type — whether this was an ATTACHMENT or a MACRO
embedded_exception — the stacktrace on the embedded file if there was a catchable parse exception thrown during the processing of the embedded file