Crawl provenance metadata
The table cc-provenance-20230303.csv.gz
contains all provenance information from the crawl (8,410,704 rows, including the header).
url_id
— primary key for each URL fetched or refetchedfile_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.url
— target url extracted from Common Crawl’s index files. Max length in this set is 6,771 characters.cc_digest
— digest calculated by Common Crawl and extracted from the index filescc_http_mime
— MIME as extracted from Common Crawl’s index files — this is derived from the http headercc_detected_mime
— the detected MIME, as extracted from Common Crawl’s index files.cc_warc_file_name
— the Common Crawl warc file where the file’s individual warc file is storedcc_warc_start
— the offset within thecc_warc_file
where the individual warc file is storedcc_warc_end
— this is the end of the individual warc file within the largercc_warc_file
cc_truncated
— this is Common Crawl’s code for why the file was truncated if the file was truncated. This information was extracted from Common Crawl’s indices. Values include:''
(6,383,873) — (empty string) — Common Crawls records this as not truncatedlength
(2,020,913) — the file was truncated because of lengthdisconnect
(5,861) — there was a network disconnection during Common Crawl’s original fetchtime
(56) — there was a timeout during Common Crawl’s original fetch
fetched_status
— records our project’s status for obtaining the file. Values include:ADDED_TO_REPOSITORY
(6,377,619) — extracted directly from the Common Crawl dataREFETCHED_SUCCESS
(1,922,505) — our project refetched content from the original target URLREFETCH_UNHAPPY_HOST
(53,038) — we tried to refetch a URL, but the failures from that host exceeded our threshold. (We didn’t want to bother a host that had refused our refetches)REFETCHED_IO_EXCEPTION_READING_ENTITY
(45,561) — during our refetch, there was an IOException while trying to read the contentsEMPTY_PAYLOAD
(5,719) — There was an empty payload in the Common Crawl warc file.REFETCHED_TIMEOUT
(5,157) — timeout during our attempted refetch.REFETCHED_IO_EXCEPTION
(569) — general IOException while we were trying to refetch.null
(506) — ??FETCHED_EXCEPTION_EMITTING
(29) — there was an exception when we tried to write a refetched PDF to S3
fetched_digest
— the sha256 that we calculated on the bytes that we have for the file, whether fetched from CC or refetchedfetched_length
— the length in bytes of the file that we extracted from Common Crawl or refetched
mime | count |
---|---|
application/pdf | 8,156,384 |
application/octet-stream | 145,722 |
text/html | 22,901 |
application/download | 14,011 |
application/force-download | 12,740 |
unk | 11,460 |
content-type: | 7,153 |
7,114 | |
application/x-download | 6,078 |
binary/octet-stream | 2,166 |
cc_http_mime
valuesmime | count |
---|---|
application/pdf | 8,389,207 |
text/html | 16,515 |
text/plain | 3,049 |
application/xhtml+xml | 814 |
application/pkcs7-signature | 210 |
application/x-tika-ooxml | 142 |
image/jpeg | 117 |
application/xml | 96 |
application/octet-stream | 78 |
application/gzip | 76 |
cc_detected_mime
valuesHosts provenance metadata
The cc-hosts-20230303.csv.gz
contains information about the hosts and, where possible, the geographic location of the host for each PDF (8,410,704 rows, including the header). The columns include:
url_id
— primary key for each URL fetched or refetched. This key can be joined with theurl_id
in thecc-provenance-20230303.csv.gz
table.file_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.host
— hosttld
— top level domainip_address
— as retrieved from Common Crawl or captured during refetchcountry
,latitude
andlongitude
— as geolocated by MaxMind’s geolite2
Of the 8.3 million URLs for which we have a file, the counts for the top 10 countries:
Country Code | Count |
---|---|
US | 3,259,209 |
DE | 896,990 |
FR | 462,215 |
JP | 364,303 |
GB | 268,950 |
IT | 228,065 |
NL | 206,389 |
RU | 176,947 |
CA | 175,853 |
ES | 173,619 |
pdfinfo
utility metadata
The pdfinfo-20230315.csv.gz
contains output from pdfinfo
(poppler version=23.03.0, data version=0.4.12). We ran this in a Docker container based on debian:bullseye-20230227-slim
with the -isodates
flag and a timeout of 2 minutes.
url_id
— primary key for each URL fetched or refetched. This key can be joined with theurl_id
in thecc-provenance-20230303.csv.gz
table.file_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.parse_time_millis
— milliseconds to process the fileexit_value
— exit value for thepdfinfo
processtimeout
— boolean for whether or not the process timed out (exit_value
= -1 in the 2 records where this happens)stderr
— stderr stream frompdfinfo
(limited to first 1,024 characters)pdf_version
— PDF version from the header comment line at the start of the PDF filecreator
— PDF creator tool from Document Information dictionary (limited to first 1,024 characters)producer
— PDF producer from Document Information dictionary (limited to first 1,024 characters)created
— date created from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)modified
— date modified from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)custom_metadata
— whether or not there is custom metadata (non-standard keys in the Document Information dictionary)metadata_stream
— whether or not there is an XMP Metadata stream (Document Catalog Metadata key)tagged
— whether the PDF is a Tagged PDF (Mark Information dictionary Marked key)user_properties
— contains user properties (Mark Information dictionary UserProperties key)form
— PDF is a form: XFA, AcroForm, ‘null’ or ” (empty)javascript
— PDF contains JavaScript (ECMAscript)pages
— number of pages according to the PDF page treepage_size
— string representing page size of the first page (in pts, 1/72 inch)page_rotation
— the page rotation of the first page (raw, as specified by the Rotate key)optimized
— is the PDF file is Linearized (a.k.a. “Fast web view” enabled)
Exit Value | Count | Notes |
---|---|---|
0 | 7,893,956 | Completed normally |
1 | 37,692 | May not be a PDF file (21,837), Encrypted file (4,295), other problem |
99 | 1,185 | Wrong page range given (1,095) typically page tree has 0 pages?! |
-1 | 2 | timeout |
1 | null | 0 byte file |
pdfinfo
exit valuesApache Tika
metadata — Overview
There are two Apache Tika metadata tables.
tika-20230714.csv.gz
— this includes metadata extracted and/or calculated by Apache Tika on the primary container/input file. Each row represents the metadata for a given input file as fetched from a specific URL. As in the other tables, these tables are “URL” based, which means that an identical file (as calculated by SHA-256) may appear several times in the file.tika-with-attachments-20230714.csv.gz
— this includes metadata extracted and/or calculated by Apache Tika on the primary container/input files and on their attachments. Each row represents an input file or its attachment(s) for a given URL
We ran a development version of Tika between versions 2.8.0 and 2.8.1. We turned off Apache Tika’s integration with tesseract-ocr
. We also turned off processing of images that were intended to be rendered.
Apache Tika metadata — Container file
tika-20230714.csv.gz
url_id
— primary key for each URL fetched or refetched. This key can be joined with theurl_id
in thecc-provenance-20230303.csv.gz
table.file_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.parse_status
— options: OK,PARSE_EXCEPTION,TIMEOUT, OOM, UNSPECIFIED_CRASHparse_time_millis
— milliseconds to process the filemime
— the file type as identified by Apache Tikamacro_count
— the number of macros/javascript files in the container file. This does not include counts of macros embedded within embedded files.attachment_count
— the number of attachments in the container file. An attachment is a file that was attached to/embedded in the container file and is intended to exist as a standalone file. We do not include in these counts — image files, inline images, font files, ICC profiles or any other embedded files that are used for the rendering or functionality of the container file. Tika looks for “attachments” in PDFs by looking for the/FileSpec
keys in the PDF.created
— date created (XMP is preferred over the standard metadata dictionary if both exist).modified
— date modified (XMP is preferred over the standard metadata dictionary if both exist).encrypted
— whether the file is encrypted or nothas_xfa
— whether the PDF has XFAhas_xmp
— whether the PDF has XMPhas_collection
— whether the PDF has a collection and is aportfolio
PDFhas_marked_content
— whether the PDF has marked contentnum_pages
— the number of pagesxmp_creator_tool
— the creator tool (XMP is preferred over the standard metadata dictionary if both exist)pdf_producer
— the producerpdf_version
— the PDF version as identified by PDFBoxpdfa_version
— the PDF/A version if this file identifies as a PDF/Apdfuaid_part
— the PDF/UA id part if the file identifies as a PDF/UApdfx_conformance
— the PDF/X conformance if the file identifies as a PDF/Xpdfx_version
— the PDF/X version if the file identifies as PDF/Xpdfxid_version
— the PDF/X id if the file identifies as PDF/Xpdfvt_version
— the PDF/VT version if the file identifies as PDF/VTpdf_num_3d_annotations
— the number of 3D annotations in the container filepdf_has_acroform_fields
— whether the PDF has AcroForm fieldspdf_incremental_updates
— the number of incremental updates as counted by Apache Tika’s rough heuristic of scanning forstartxref
and%%EOF
pdf_overall_unmapped_unicode_chars
— the percentage of characters extracted from the PDF that do not have Unicode mappings.pdf_contains_damaged_font
— whether PDFBox identifies a damaged fontpdf_contains_non_embedded_font
— whether PDFBox identifies a non-embedded fonthas_signature
— whether the file has a digital signature. This can be true of PDFs and MSOffice files.location
–latitude,longitude
when extracted from the metadata of a file (e.g. EXIF metadata); applies to embedded files, not as much to container files that are PDFstika_eval_num_tokens
— the number of tokens (words) that were counted in the extracted text by thetika-eval
moduletika_eval_num_alpha_tokens
— the number of alphabetic tokens (words) that were counted in the extracted text by thetika-eval
moduletika_eval_lang
— the language as identified bytika-eval
‘s language detector on the extracted text (statistical classifier based on character frequencies)tika_eval_oov
— the out of vocabulary statistic as calculated bytika-eval
. After running language identification on the extracted text, thetika-eval
module counts how many words in the extracted text were in the top 20k most common words for the identified language. When there are enough tokens (> 100) and this value is high, that may indicate that the extracted text is garbled.- container_exception — the stacktrace if there was a parse exception on the file
Apache Tika metadata — Container file with Attachments
tika-with-attachments-20230714.csv.gz
attachment_num
— if the file is an attachment, this is the attachment number within the primary container fileemb_depth
— the embedded depth of the attachment (if an attachment)embedded_id
— a unique id for the embedded fileembedded_id_path
— the path based on the unique ids for the embedded file. For example if a PDF has an attached MSG file (id=1) that itself has an attached DOCX file (id=2), then the path for the DOCX file would be /1/2embedded_resource_type
— whether this was an ATTACHMENT or a MACROembedded_exception
— the stacktrace on the embedded file if there was a catchable parse exception thrown during the processing of the embedded file