Constructing the CC-MAIN-2021-31-PDF-UNTRUNCATED corpus

Types of Common Crawl Data used

This project used two types of data from Common Crawl. For more information on the types of data available for each crawl, see Common Crawl’s Getting Started page.

Common Crawl indices

The indices are gzipped text files, where each line contains a key to enable easy sorting of URLs by host and domain, a timestamp and a JSON object that contains metadata about each URL. Information in the JSON object includes, among other things: URL, mime, detected mime, and the location of the individual WARC file as specified by the path to the compound WARC and the offset and length of the individual WARC file within the compound WARC. See commoncrawl-fetcher-lite for more details.

Common Crawl WARCs

Common Crawl concatenates gzipped WARCs into very large WARC files. To fetch an individual file’s original WARC, users need to know the source WARC file, the offset for the individual file and the length. See below for a worked example.

File Types

Our team processed the indices for this crawl and extracted all files where an http Content-Type header contained the letters pdf or where Common Crawl’s automatic file detection detected a PDF. We acknowledge that this choice will result in files that are not actually PDFs.

Common Crawl or Refetched

In the indices for a crawl, Common Crawl has a flag for whether or not the file was truncated. We extracted roughly 6 million files directly from Common Crawl. We then refetched from the original URLs nearly 2 million files that Common Crawl had identified as truncated.

Filenaming

We sorted the files by sha-256 and then numbered them from 0 (0000000.pdf) to roughly 8 million (7932877.pdf). We added a .pdf file extension to every file.

Errata

We are aware that the following PDFs are missing from the corpus. There were caused by sporadic S3 write exceptions during the fetching and refetching.

File namesha256
177150.pdf05ba53532b7bfc15901bc1bd3371421be758bb08cc2070528a49be4c0b77c6c7
594742.pdf1334239e569fad2a30d11f6f90d5f75645ded13870cd9b6118b4930d297a23e9
706328.pdf16cd8100c6a8710d5c404ee11bfc285efee5693c6ceaa42fce2b466051b2c40a
1260258.pdf28a410c2b3a767d618b44980be1a68335fd436e70165211d03421fcd198e4de7
1544119.pdf31ca2adee5ea5ac522bf02db2a9a70bdc0e220ccc242dce9b22254e9a3f7c8fa
1591732.pdf3354af25e39f6ccfabb7833f14958512537dd019e9d4dddeb912fb5b5799158b
1640603.pdf34eb229ecac8ddecf1632a06762a1998477c07d56249db84edfd157245b6022c
1890087.pdf3cf45e3dc0fdf429ac894d77ea85460db744dd93c8704102b914974e7b963630
1920911.pdf3df2586c61b34ad857b4f13eebf1bf2fd8f1a9af71c582c26640278166ba1f7f
1992331.pdf403f27afa6c84a5fbc512361d9929ef49ae00d399f1b1f876c26a900d056a846
2519839.pdf51467cf4516df4919c3b195ad67c10a668d339a705c4644ce60fd69f39f6730e
2712444.pdf577c5f029ff827362b5a71d14f1e4a015bea3eb53960e250ffa1dde2f7ae0050
2765539.pdf59343aae861d86d9d360b4ccf0183f33a77e49b67696ee1f900821e7dad1f04e
3179469.pdf669693d161926d705d63ac8fed895857549b4b7e5d82c2ead56a07c367616fb5
4170238.pdf86931ce5974bff673eb48aa4159b6c215efea4ce636f8e486e9fd54c14e33e9f
4414331.pdf8e77a888f6ac85d24ac63e55810c2b2646ba18f540037ac748b50007f7c1c8c8
4512373.pdf91a3d6390adceb54e0ff993f8cfd58250f1bbabfd5ef061a7659ed019897d179
4977579.pdfa09de5d289dd95d4b4b71d13e196e05db5ab5d228c65afcd74e5900a40a11b09
5198714.pdfa7c81076098d7e179d13ab60a8da6c8897f71315060b73b959667e0f8ff385b9
5236677.pdfa9031fc3fbaecf9abcb906e630fdeb90e71e1f9e3d78959ff5101e0fdaa7de65
5447694.pdfafd19ad6ca780aa7c90756e97aa20fb11bf4781cfc0ee00e5bf23f66f940f51a
6318895.pdfcbeb29136aaa7b934c2b8616dafa4b8b9213235ed1be9c818c3858c990914275
6817632.pdfdc0840305e174825fa1471dc2ab463bdffece4ec78b496b5e6a65245f4df4cc1
6940914.pdfe004f3c7cf38f24ed278b9d3c30c5269f625ed66c623bd6f46ecb3aed9dac3d4
7241425.pdfe9b4ec5975d197ffc9d199a188d68cc75cb323ecca545df0668c567bc04a769a
7279847.pdfeaf2d8ba2606262e861d5e8fe0b26b9c456d1fd3290c17d7c115dc14e02a73ca
7407159.pdfef107d1cd9224d3582a1364b012f1585a6192ef1fa3267ab18c078777083091f
7635694.pdff670fc79401a83b67f2695666803fb8e2ef2fe05a20c2880ea9f0b7465431523
7889525.pdffe9b31aa4fcf115ae893ffb2937558a11ee7c80ed9dd1908c3a9451ae8d3c140

Extras

1. How to extract an individual WARC from Common Crawl

First, users need the cc_warc_file, the cc_warc_start and the cc_warc_end from the provenance table. We’ll use curl and gunzip. Let’s say we want to pull 0000000.pdf which comes from crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz starting at offset 3,724,499 and ends at offset 3,742,341 (inclusive).

  1. Prepend https://data.commoncrawl.org/ to the cc_warc_file to get the URL.
  2. The http range will be: 3724499-3742341
  3. Fetch the gzipped WARC file: curl -r 3724499-3742341 https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz -o 0000000.warc.gz
  4. gunzip 0000000.warc.gz