SAFEDOCS (CC-MAIN-2021-31-PDF-UNTRUNCATED)

Overview

This corpus contains nearly 8 million PDFs gathered from across the web in July/August of 2021. The PDF files were initially identified by Common Crawl as part of their July/August 2021 crawl (identified as CC-MAIN-2021-31) and subsequently updated and collated as part of the DARPA SafeDocs program.

The corpus can be downloaded using HTTPS or accessed directly using Amazon’s AWS S3 protocol:

HTTPS download https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/
S3 Prefix s3://digitalcorpora/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/

This current corpus offers five benefits over Common Crawl datasets as stored in Amazon Public Datasets:

  1. Common Crawl truncates files at 1MB. For this corpus, we refetched the complete/untruncated PDF files from the original URLs without any file size limitation.
  2. This corpus offers a tractable subset of the files, focusing on a single format: PDF.
  3. We have supplemented the metadata to include geo-ip-location (where possible) and other metadata extracted from the PDF files (e.g. by pdfinfo).
  4. All PDF files (both Common Crawl <1MB PDFs and the larger truncated PDFs that were refetched) are conveniently packaged in the ZIP format. This is the same as GovDocs1.
  5. At the time of its creation, this is the largest single corpus of real-world (extant) PDFs that is publicly available. Many other smaller, targeted or synthetic PDF-centric corpora exist.

It is not possible to rigorously assess how representative this corpus is of all PDF files on the entire web or of PDF files in general. It is well known that a significant number of PDF files lie within private intranets or repositories, behind logins, and are not made publicly accessible due to PII or other confidential content. This means that all corpora created by web crawling may not adequately represent every PDF feature or capability. Even as web crawls go, preliminary analysis suggests that Common Crawl data can be viewed as a convenience sample.  In short, the crawls (and this corpus) may not be fully representative nor complete, but they do offer a large set of data from the publicly accessible web.

For the specific CC-MAIN-2021-31 crawl, the Common Crawl project writes:

The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1 billion new URLs, not visited in any of our prior crawls.

We could not have done this work without the initial Common Crawl data. Please note Common Crawl’s license and terms of use.

Application

PDF is a ubiquitous format and is used across many industrial and research domains. Many existing corpora focusing on extant data (such as GovDocs1) are now quite old and no longer reflect current changes and trends in both PDF itself (as a file format) or in PDF-creating and authoring applications. With advances in machine learning technology, the need for larger data sets is also in high demand. This corpus is thus helpful for:

  • PDF technology and software testing, assessment, and evaluation
  • Information privacy research
  • Document understanding, text extraction, table identification, OCR/ICR, formula identification, document recognition and analysis, and related document engineering domains
  • Malware and cyber-security research
  • ML/AI applications and research (document classification, document content, text extraction, etc)
  • Preservation and archival research
  • Usability and accessibility research
  • Software engineering research (parsing, formal methods, etc.)

Packaging

All PDF files are named using a sequential 7-digit number with a .pdf extension (e.g. 0000000.pdf, 0000001.pdf through 7932877.pdf) – the file number is arbitrary in this corpus as it is based on the SHA-256 of the PDF. Duplicate PDF files (based on the SHA-256 hash of the file) have been removed – there are 8.3 million URLs for which we have a PDF file, and there are 7.9 million unique PDF files.

PDF files are then packaged into ZIP files based on their sequentially numbered filename, with each ZIP file containing up to 1,000 PDF files (less if duplicates were detected and removed). The resulting ZIP files range in size from just under 1.0 GB to about 2.8 GB. With a few exceptions, all of the 7,933 ZIP files in the zipfiles/ sub-directory tree contain 1,000 PDF files (see the Errata section on the “Constructing the CC-MAIN-2021-31-PDF-UNTRUNCATED corpus” page).

Each ZIP is named using a sequential 4-digit number representing the high 4 digits of the 7-digit PDF files in the ZIP – so 0000.zip contains all PDFs numbered from 0000000.pdf to 0000999.pdf; 0001.zip contains PDFs numbered from 0001000.pdf to 0001999.pdf; etc. ZIP files are clustered into groups of 1,000 and stored in sub-directories below zipfiles/ based on the 4-digit ZIP filename, where each sub-directory is limited to 1,000 ZIP files: zipfiles/0000-0999/, zipfiles/1000-1999/, etc.

The entire corpus when uncompressed takes up nearly 8 TB.

Supplementary Metadata

We include tables to link each PDF file back to the original Common Crawl record in the CC-MAIN-2021-31 dataset and to offer a richer view of the data via extracted metadata. These are placed in the metadata/ sub-directory. There tables of metadata are provided:

  1. Crawl provenance metadata
  2. Hosts provenance metadata
  3. pdfinfo utility metadata
  4. Apache Tika metadata

For each table, we include the full table as a gzipped, UTF-8 encoded, CSV (e.g. cc-provenance-20230303.csv.gz). Detailed information about each table can be found on the CC-MAIN-2021-31-PDF-UNTRUNCATED Metadata page.

We also include an uncompressed copy of each metadata table with the data relevant to 0000.zip so that users may easily familiarize themselves with a smaller portion of the data (e.g. cc-provenance-20230324-1k.csv). Note that there are 1,045 data rows in these *-1k.csv tables because these tables are URL-based – the same PDF may have come from multiple URLs. For example, 0000374.pdf was retrieved from five URLs, so it appears five times in these tables.

Further note that due to Unicode-encoded metadata, the *-1k.csv tables have a UTF-8 Byte Order Marker (BOM) prepended so that they may easily be opened by spreadsheet applications (such as Microsoft Excel) by double-clicking, and not result in mojibake. This is because such applications will not prompt for an encoding when opening CSV files directly – the prompts for delimiters and encoding only occur if manually importing the data into these spreadsheet applications.

The very large gzipped metadata CSV files for the entire corpus do not have UTF-8 BOMs added as these are not directly usable by office applications.

The dates in the file names denote when the metadata file was created.

Related Work

The background on the construction of this corpus is documented separately.

  • Allison, Timothy. “Making more sense of PDF structures in the wild at scale.” PDF Days Europe 2022, September 12-13, 2022. Video and slide deck.
  • Allison, Timothy. “Building a File Observatory: Making sense of PDFs in the Wild“. Open Preservation Foundation Webinar, January 19, 2022. Slide deck
  • Allison, Timothy. “Making sense of PDF structures in the wild at scale“. PDF Days Online 2021, September 29, 2021. Video and slide deck.
  • Allison, Timothy; Burke, Wayne; Mattmann, Chris; Menshikova, Anastasia; Southam, Philip; Stonebraker, Ryan; and Timmaraju, Virisha, “Building a Wide Reach Corpus for Secure Parser Development“, IEEE Security & Privacy LangSec Workshop, May 21, 2020. Slides and paper.

Credits

This dataset was gathered by a team at NASA’s Jet Propulsion Laboratory (JPL), California Institute of Technology while supporting the Defense Advance Research Project Agency (DARPA)’s SafeDocs Program. The JPL team included Chris Mattmann (PI), Wayne Burke, Dustin Graf, Tim Allison, Ryan Stonebraker, Mike Milano, Philip Southam and Anastasia Menshikova.

The JPL team collaborated with Peter Wyatt, the Chief Technology Officer of the PDF Association and PI on the SafeDocs program, in the design and documentation of this corpus.

The JPL team and PDF Association would like to thank Simson Garfinkel and Digital Corpora for taking ownership of this dataset and publishing it. Our thanks are extended to the Amazon Open Data Sponsorship Program for enabling this large corpus to be free and publicly available as part of Digital Corpora initiative.

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

The research was carried out at the NASA (National Aeronautics and Space Administration) Jet Propulsion Laboratory, California Institute of Technology under a contract with the Defense Advanced Research Projects Agency (DARPA) SafeDocs program. Government sponsorship acknowledged.