Crawl provenance metadata
The table cc-provenance-20230303.csv.gz
contains all provenance information from the crawl (8,410,704 rows, including the header).
url_id
— primary key for each URL fetched or refetchedfile_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.url
— target url extracted from Common Crawl’s index files. Max length in this set is 6,771 characters.cc_digest
— digest calculated by Common Crawl and extracted from the index filescc_http_mime
— MIME as extracted from Common Crawl’s index files — this is derived from the http headercc_detected_mime
— the detected MIME, as extracted from Common Crawl’s index files.cc_warc_file_name
— the Common Crawl warc file where the file’s individual warc file is storedcc_warc_start
— the offset within thecc_warc_file
where the individual warc file is storedcc_warc_end
— this is the end of the individual warc file within the largercc_warc_file
cc_truncated
— this is Common Crawl’s code for why the file was truncated if the file was truncated. This information was extracted from Common Crawl’s indices. Values include:''
(6,383,873) — (empty string) — Common Crawls records this as not truncatedlength
(2,020,913) — the file was truncated because of lengthdisconnect
(5,861) — there was a network disconnection during Common Crawl’s original fetchtime
(56) — there was a timeout during Common Crawl’s original fetch
fetched_status
— records our project’s status for obtaining the file. Values include:ADDED_TO_REPOSITORY
(6,377,619) — extracted directly from the Common Crawl dataREFETCHED_SUCCESS
(1,922,505) — our project refetched content from the original target URLREFETCH_UNHAPPY_HOST
(53,038) — we tried to refetch a URL, but the failures from that host exceeded our threshold. (We didn’t want to bother a host that had refused our refetches)REFETCHED_IO_EXCEPTION_READING_ENTITY
(45,561) — during our refetch, there was an IOException while trying to read the contentsEMPTY_PAYLOAD
(5,719) — There was an empty payload in the Common Crawl warc file.REFETCHED_TIMEOUT
(5,157) — timeout during our attempted refetch.REFETCHED_IO_EXCEPTION
(569) — general IOException while we were trying to refetch.null
(506) — ??FETCHED_EXCEPTION_EMITTING
(29) — there was an exception when we tried to write a refetched PDF to S3
fetched_digest
— the sha256 that we calculated on the bytes that we have for the file, whether fetched from CC or refetchedfetched_length
— the length in bytes of the file that we extracted from Common Crawl or refetched
mime | count |
---|---|
application/pdf | 8,156,384 |
application/octet-stream | 145,722 |
text/html | 22,901 |
application/download | 14,011 |
application/force-download | 12,740 |
unk | 11,460 |
content-type: | 7,153 |
7,114 | |
application/x-download | 6,078 |
binary/octet-stream | 2,166 |
cc_http_mime
valuesmime | count |
---|---|
application/pdf | 8,389,207 |
text/html | 16,515 |
text/plain | 3,049 |
application/xhtml+xml | 814 |
application/pkcs7-signature | 210 |
application/x-tika-ooxml | 142 |
image/jpeg | 117 |
application/xml | 96 |
application/octet-stream | 78 |
application/gzip | 76 |
cc_detected_mime
valuesHosts provenance metadata
The cc-hosts-20230303.csv.gz
contains information about the hosts and, where possible, the geographic location of the host for each PDF (8,410,704 rows, including the header). The columns include:
url_id
— primary key for each URL fetched or refetched. This key can be joined with theurl_id
in thecc-provenance-20230303.csv.gz
table.file_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.host
— hosttld
— top level domainip_address
— as retrieved from Common Crawl or captured during refetchcountry
,latitude
andlongitude
— as geolocated by MaxMind’s geolite2
Of the 8.3 million URLs for which we have a file, the counts for the top 10 countries:
Country Code | Count |
---|---|
US | 3,259,209 |
DE | 896,990 |
FR | 462,215 |
JP | 364,303 |
GB | 268,950 |
IT | 228,065 |
NL | 206,389 |
RU | 176,947 |
CA | 175,853 |
ES | 173,619 |
pdfinfo
utility metadata
The pdfinfo-20230315.csv.gz
contains output from pdfinfo
(poppler version=23.03.0, data version=0.4.12). We ran this in a Docker container based on debian:bullseye-20230227-slim
with the -isodates
flag and a timeout of 2 minutes.
url_id
— primary key for each URL fetched or refetched. This key can be joined with theurl_id
in thecc-provenance-20230303.csv.gz
table.file_name
— name of the PDF file as our project named it inside the zip. This value is not unique in this table because a given PDF (as identified by its sha256) may have been fetched from multiple URLs.parse_time_millis
— milliseconds to process the fileexit_value
— exit value for thepdfinfo
processtimeout
— boolean for whether or not the process timed out (exit_value
= -1 in the 2 records where this happens)stderr
— stderr stream frompdfinfo
(limited to first 1,024 characters)pdf_version
— PDF version from the header comment line at the start of the PDF filecreator
— PDF creator tool from Document Information dictionary (limited to first 1,024 characters)producer
— PDF producer from Document Information dictionary (limited to first 1,024 characters)created
— date created from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)modified
— date modified from Document Information dictionary in ISO-8601 format (format: 2021-06-11T17:42:51+08 or 2021-07-31T19:31:14Z)custom_metadata
— whether or not there is custom metadata (non-standard keys in the Document Information dictionary)metadata_stream
— whether or not there is an XMP Metadata stream (Document Catalog Metadata key)tagged
— whether the PDF is a Tagged PDF (Mark Information dictionary Marked key)user_properties
— contains user properties (Mark Information dictionary UserProperties key)form
— PDF is a form: XFA, AcroForm, ‘null’ or ” (empty)javascript
— PDF contains JavaScript (ECMAscript)pages
— number of pages according to the PDF page treepage_size
— string representing page size of the first page (in pts, 1/72 inch)page_rotation
— the page rotation of the first page (raw, as specified by the Rotate key)optimized
— is the PDF file is Linearized (a.k.a. “Fast web view” enabled)
Exit Value | Count | Notes |
---|---|---|
0 | 7,893,956 | Completed normally |
1 | 37,692 | May not be a PDF file (21,837), Encrypted file (4,295), other problem |
99 | 1,185 | Wrong page range given (1,095) typically page tree has 0 pages?! |
-1 | 2 | timeout |
1 | null | 0 byte file |
pdfinfo
exit values