Constructing the CC-MAIN-2021-31-PDF-UNTRUNCATED corpus

Types of Common Crawl Data used

This project used two types of data from Common Crawl. For more information on the types of data available for each crawl, see Common Crawl’s Getting Started page.

Common Crawl indices

The indices are gzipped text files, where each line contains a key to enable easy sorting of URLs by host and domain, a timestamp and a JSON object that contains metadata about each URL. Information in the JSON object includes, among other things: URL, mime, detected mime, and the location of the individual WARC file as specified by the path to the compound WARC and the offset and length of the individual WARC file within the compound WARC. See commoncrawl-fetcher-lite for more details.

Common Crawl WARCs

Common Crawl concatenates gzipped WARCs into very large WARC files. To fetch an individual file’s original WARC, users need to know the source WARC file, the offset for the individual file and the length. See below for a worked example.

File Types

Our team processed the indices for this crawl and extracted all files where an http Content-Type header contained the letters pdf or where Common Crawl’s automatic file detection detected a PDF. We acknowledge that this choice will result in files that are not actually PDFs.

Common Crawl or Refetched

In the indices for a crawl, Common Crawl has a flag for whether or not the file was truncated. We extracted roughly 6 million files directly from Common Crawl. We then refetched from the original URLs nearly 2 million files that Common Crawl had identified as truncated.

Filenaming

We sorted the files by sha-256 and then numbered them from 0 (0000000.pdf) to roughly 8 million (7932877.pdf). We added a .pdf file extension to every file.

Errata

We are aware that the following PDFs are missing from the corpus. There were caused by sporadic S3 write exceptions during the fetching and refetching.

File namesha256
177150.pdf05ba53532b7bfc15901bc1bd3371421be758bb08cc2070528a49be4c0b77c6c7
594742.pdf1334239e569fad2a30d11f6f90d5f75645ded13870cd9b6118b4930d297a23e9
706328.pdf16cd8100c6a8710d5c404ee11bfc285efee5693c6ceaa42fce2b466051b2c40a
1260258.pdf28a410c2b3a767d618b44980be1a68335fd436e70165211d03421fcd198e4de7
1544119.pdf31ca2adee5ea5ac522bf02db2a9a70bdc0e220ccc242dce9b22254e9a3f7c8fa
1591732.pdf3354af25e39f6ccfabb7833f14958512537dd019e9d4dddeb912fb5b5799158b
1640603.pdf34eb229ecac8ddecf1632a06762a1998477c07d56249db84edfd157245b6022c
1890087.pdf3cf45e3dc0fdf429ac894d77ea85460db744dd93c8704102b914974e7b963630
1920911.pdf3df2586c61b34ad857b4f13eebf1bf2fd8f1a9af71c582c26640278166ba1f7f
1992331.pdf403f27afa6c84a5fbc512361d9929ef49ae00d399f1b1f876c26a900d056a846
2519839.pdf51467cf4516df4919c3b195ad67c10a668d339a705c4644ce60fd69f39f6730e
2712444.pdf577c5f029ff827362b5a71d14f1e4a015bea3eb53960e250ffa1dde2f7ae0050
2765539.pdf59343aae861d86d9d360b4ccf0183f33a77e49b67696ee1f900821e7dad1f04e
3179469.pdf669693d161926d705d63ac8fed895857549b4b7e5d82c2ead56a07c367616fb5
4170238.pdf86931ce5974bff673eb48aa4159b6c215efea4ce636f8e486e9fd54c14e33e9f
4414331.pdf8e77a888f6ac85d24ac63e55810c2b2646ba18f540037ac748b50007f7c1c8c8
4512373.pdf91a3d6390adceb54e0ff993f8cfd58250f1bbabfd5ef061a7659ed019897d179
4977579.pdfa09de5d289dd95d4b4b71d13e196e05db5ab5d228c65afcd74e5900a40a11b09
5198714.pdfa7c81076098d7e179d13ab60a8da6c8897f71315060b73b959667e0f8ff385b9
5236677.pdfa9031fc3fbaecf9abcb906e630fdeb90e71e1f9e3d78959ff5101e0fdaa7de65
5447694.pdfafd19ad6ca780aa7c90756e97aa20fb11bf4781cfc0ee00e5bf23f66f940f51a
6318895.pdfcbeb29136aaa7b934c2b8616dafa4b8b9213235ed1be9c818c3858c990914275
6817632.pdfdc0840305e174825fa1471dc2ab463bdffece4ec78b496b5e6a65245f4df4cc1
6940914.pdfe004f3c7cf38f24ed278b9d3c30c5269f625ed66c623bd6f46ecb3aed9dac3d4
7241425.pdfe9b4ec5975d197ffc9d199a188d68cc75cb323ecca545df0668c567bc04a769a
7279847.pdfeaf2d8ba2606262e861d5e8fe0b26b9c456d1fd3290c17d7c115dc14e02a73ca
7407159.pdfef107d1cd9224d3582a1364b012f1585a6192ef1fa3267ab18c078777083091f
7635694.pdff670fc79401a83b67f2695666803fb8e2ef2fe05a20c2880ea9f0b7465431523
7889525.pdffe9b31aa4fcf115ae893ffb2937558a11ee7c80ed9dd1908c3a9451ae8d3c140

We have also been informed that the following 25 files are missing. We are researching the root cause to see if these files can be recovered.

File namesha256
5148444.pdfa6263eb21c0cd9921f8ea19bd0ac29b25f8939b9ec9df6f72c9c283938003418
5217141.pdfa86252e966475bd5841c58623969b44b6addd06d085c101d61ce992d4310e3e4
5429506.pdfaf3ad3b44dba45a7254fb2f02b4e761e0079042b122801edba356b521767672b
5511351.pdfb1de9844321d3bd425a01f4b323e7bc9235b8f3fe491069dcc6bc84ec065ea0a
5588341.pdfb45a2d388e608840beebfde5b3d54b2b8473255612500a25186c729cd554b2c4
5627292.pdfb59b6f49184f8adc498ac8df859e9561e58b5bcfb2c636572578105a7932ba9a
5729318.pdfb8e4769302218e56b79a0be24c4c84feaaf4a7ca28e648499a5209e64a78d48c
5818642.pdfbbc5f99b12d4c63bc18a9147b886edfa39f3ef3010af4e250b2fce8cfe175050
6241601.pdfc96945f0dfcbea92d8e7b73441b44713d4e8126ff2b6186841623e410145a534
6305828.pdfcb7f2feff6f316bf1b8b02be0f235d2f1bd76c84eda311a2e10f4b779d337440
6564043.pdfd3d4f3d648263881122fc8c4c16a56b1c6a8854c980310e272e2a4b1f1980a74
6567049.pdfd3ee8804b8fb70fa49e4a2af79c3c1103ba9616c191bfbe78aaa9997c1cd4ebf
6627962.pdfd5e4eae9206ca7549268816f995d7862eb8435a2f1d744e7a490b249accf92cb
6645396.pdfd676433bb59b5734ea09bf0984c4e521c07ada0be65f7f57fbc551e344ebf9da
6768494.pdfda70d4360322380a94d7a9ea258a3b65805b7d6c94306357143ca131854f3ebf
6918017.pdfdf4707852d2777b3bb9e6c14aa4142a491edad488814e37886e7f86967ed2a7f
6976174.pdfe127fef3a3db31fb5cddd8e19622de82cac969e1f14151a0584d9f15b52e2fb9
7156801.pdfe6ff591b3718dd0ea5a530533f2ff9163e98673392b8d3f57da321c9c44e06b4
7503043.pdff229c1f9ef8f6be67d660cac3e452cf7a7870b95369fc46ee6f8fe0456c59698
7569116.pdff44c12f77ef8dc2eb0d3894f7025cc93e5ce4b34d58bdbcbb9fd5496b9629fe9
7622768.pdff606ff03ed252dc56292734a8b4bc1373ce6e29dc04f2bdcb212d4aaa5f7c6f5
7744455.pdff9ef08b4f915862afb0ea019b60020fce58da555524f15612a55e00fa05e19e8
7767671.pdffaad69c320245cf76b1754d36ee63a9b01b970cbf6040158d0577a1f8ef14752
7850363.pdffd58e8fc878c61842da2c988638700e6be7443e3b1649c97ba39758c1d035d88
7917938.pdfff83b4c404c3681b31190675383a89cb847cc6b72f7c6d34847285abec0eeb77

Extras

1. How to extract an individual WARC from Common Crawl

First, users need the cc_warc_file, the cc_warc_start and the cc_warc_end from the provenance table. We’ll use curl and gunzip. Let’s say we want to pull 0000000.pdf which comes from crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz starting at offset 3,724,499 and ends at offset 3,742,341 (inclusive).

  1. Prepend https://data.commoncrawl.org/ to the cc_warc_file to get the URL.
  2. The http range will be: 3724499-3742341
  3. Fetch the gzipped WARC file: curl -r 3724499-3742341 https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz -o 0000000.warc.gz
  4. gunzip 0000000.warc.gz