Types of Common Crawl Data used

This project used two types of data from Common Crawl. For more information on the types of data available for each crawl, see Common Crawl’s Getting Started page.

Common Crawl indices

The indices are gzipped text files, where each line contains a key to enable easy sorting of URLs by host and domain, a timestamp and a JSON object that contains metadata about each URL. Information in the JSON object includes, among other things: URL, mime, detected mime, and the location of the individual WARC file as specified by the path to the compound WARC and the offset and length of the individual WARC file within the compound WARC. See commoncrawl-fetcher-lite for more details.

Common Crawl WARCs

Common Crawl concatenates gzipped WARCs into very large WARC files. To fetch an individual file’s original WARC, users need to know the source WARC file, the offset for the individual file and the length. See below for a worked example.

File Types

Our team processed the indices for this crawl and extracted all files where an http Content-Type header contained the letters pdf or where Common Crawl’s automatic file detection detected a PDF. We acknowledge that this choice will result in files that are not actually PDFs.

Common Crawl or Refetched

In the indices for a crawl, Common Crawl has a flag for whether or not the file was truncated. We extracted roughly 6 million files directly from Common Crawl. We then refetched from the original URLs nearly 2 million files that Common Crawl had identified as truncated.

Filenaming

We sorted the files by sha-256 and then numbered them from 0 (0000000.pdf) to roughly 8 million (7932877.pdf). We added a .pdf file extension to every file.

Errata

We are aware that the following PDFs are missing from the corpus. There were caused by sporadic S3 write exceptions during the fetching and refetching.

File name	sha256
177150.pdf	05ba53532b7bfc15901bc1bd3371421be758bb08cc2070528a49be4c0b77c6c7
594742.pdf	1334239e569fad2a30d11f6f90d5f75645ded13870cd9b6118b4930d297a23e9
706328.pdf	16cd8100c6a8710d5c404ee11bfc285efee5693c6ceaa42fce2b466051b2c40a
1260258.pdf	28a410c2b3a767d618b44980be1a68335fd436e70165211d03421fcd198e4de7
1544119.pdf	31ca2adee5ea5ac522bf02db2a9a70bdc0e220ccc242dce9b22254e9a3f7c8fa
1591732.pdf	3354af25e39f6ccfabb7833f14958512537dd019e9d4dddeb912fb5b5799158b
1640603.pdf	34eb229ecac8ddecf1632a06762a1998477c07d56249db84edfd157245b6022c
1890087.pdf	3cf45e3dc0fdf429ac894d77ea85460db744dd93c8704102b914974e7b963630
1920911.pdf	3df2586c61b34ad857b4f13eebf1bf2fd8f1a9af71c582c26640278166ba1f7f
1992331.pdf	403f27afa6c84a5fbc512361d9929ef49ae00d399f1b1f876c26a900d056a846
2519839.pdf	51467cf4516df4919c3b195ad67c10a668d339a705c4644ce60fd69f39f6730e
2712444.pdf	577c5f029ff827362b5a71d14f1e4a015bea3eb53960e250ffa1dde2f7ae0050
2765539.pdf	59343aae861d86d9d360b4ccf0183f33a77e49b67696ee1f900821e7dad1f04e
3179469.pdf	669693d161926d705d63ac8fed895857549b4b7e5d82c2ead56a07c367616fb5
4170238.pdf	86931ce5974bff673eb48aa4159b6c215efea4ce636f8e486e9fd54c14e33e9f
4414331.pdf	8e77a888f6ac85d24ac63e55810c2b2646ba18f540037ac748b50007f7c1c8c8
4512373.pdf	91a3d6390adceb54e0ff993f8cfd58250f1bbabfd5ef061a7659ed019897d179
4977579.pdf	a09de5d289dd95d4b4b71d13e196e05db5ab5d228c65afcd74e5900a40a11b09
5198714.pdf	a7c81076098d7e179d13ab60a8da6c8897f71315060b73b959667e0f8ff385b9
5236677.pdf	a9031fc3fbaecf9abcb906e630fdeb90e71e1f9e3d78959ff5101e0fdaa7de65
5447694.pdf	afd19ad6ca780aa7c90756e97aa20fb11bf4781cfc0ee00e5bf23f66f940f51a
6318895.pdf	cbeb29136aaa7b934c2b8616dafa4b8b9213235ed1be9c818c3858c990914275
6817632.pdf	dc0840305e174825fa1471dc2ab463bdffece4ec78b496b5e6a65245f4df4cc1
6940914.pdf	e004f3c7cf38f24ed278b9d3c30c5269f625ed66c623bd6f46ecb3aed9dac3d4
7241425.pdf	e9b4ec5975d197ffc9d199a188d68cc75cb323ecca545df0668c567bc04a769a
7279847.pdf	eaf2d8ba2606262e861d5e8fe0b26b9c456d1fd3290c17d7c115dc14e02a73ca
7407159.pdf	ef107d1cd9224d3582a1364b012f1585a6192ef1fa3267ab18c078777083091f
7635694.pdf	f670fc79401a83b67f2695666803fb8e2ef2fe05a20c2880ea9f0b7465431523
7889525.pdf	fe9b31aa4fcf115ae893ffb2937558a11ee7c80ed9dd1908c3a9451ae8d3c140

Extras

1. How to extract an individual WARC from Common Crawl

First, users need the cc_warc_file, the cc_warc_start and the cc_warc_end from the provenance table. We’ll use curl and gunzip. Let’s say we want to pull 0000000.pdf which comes from crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz starting at offset 3,724,499 and ends at offset 3,742,341 (inclusive).

Prepend https://data.commoncrawl.org/ to the cc_warc_file to get the URL.
The http range will be: 3724499-3742341
Fetch the gzipped WARC file: curl -r 3724499-3742341 https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz -o 0000000.warc.gz
gunzip 0000000.warc.gz