Types of Common Crawl Data used
This project used two types of data from Common Crawl. For more information on the types of data available for each crawl, see Common Crawl’s Getting Started page.
Common Crawl indices
The indices are gzipped text files, where each line contains a key to enable easy sorting of URLs by host and domain, a timestamp and a JSON object that contains metadata about each URL. Information in the JSON object includes, among other things: URL, mime, detected mime, and the location of the individual WARC file as specified by the path to the compound WARC and the offset and length of the individual WARC file within the compound WARC. See commoncrawl-fetcher-lite for more details.
Common Crawl WARCs
Common Crawl concatenates gzipped WARCs into very large WARC files. To fetch an individual file’s original WARC, users need to know the source WARC file, the offset for the individual file and the length. See below for a worked example.
File Types
Our team processed the indices for this crawl and extracted all files where an http Content-Type
header contained the letters pdf
or where Common Crawl’s automatic file detection detected a PDF. We acknowledge that this choice will result in files that are not actually PDFs.
Common Crawl or Refetched
In the indices for a crawl, Common Crawl has a flag for whether or not the file was truncated. We extracted roughly 6 million files directly from Common Crawl. We then refetched from the original URLs nearly 2 million files that Common Crawl had identified as truncated.
Filenaming
We sorted the files by sha-256
and then numbered them from 0 (0000000.pdf
) to roughly 8 million (7932877.pdf
). We added a .pdf
file extension to every file.
Errata
We are aware that the following PDFs are missing from the corpus. There were caused by sporadic S3 write exceptions during the fetching and refetching.
File name | sha256 |
---|---|
177150.pdf | 05ba53532b7bfc15901bc1bd3371421be758bb08cc2070528a49be4c0b77c6c7 |
594742.pdf | 1334239e569fad2a30d11f6f90d5f75645ded13870cd9b6118b4930d297a23e9 |
706328.pdf | 16cd8100c6a8710d5c404ee11bfc285efee5693c6ceaa42fce2b466051b2c40a |
1260258.pdf | 28a410c2b3a767d618b44980be1a68335fd436e70165211d03421fcd198e4de7 |
1544119.pdf | 31ca2adee5ea5ac522bf02db2a9a70bdc0e220ccc242dce9b22254e9a3f7c8fa |
1591732.pdf | 3354af25e39f6ccfabb7833f14958512537dd019e9d4dddeb912fb5b5799158b |
1640603.pdf | 34eb229ecac8ddecf1632a06762a1998477c07d56249db84edfd157245b6022c |
1890087.pdf | 3cf45e3dc0fdf429ac894d77ea85460db744dd93c8704102b914974e7b963630 |
1920911.pdf | 3df2586c61b34ad857b4f13eebf1bf2fd8f1a9af71c582c26640278166ba1f7f |
1992331.pdf | 403f27afa6c84a5fbc512361d9929ef49ae00d399f1b1f876c26a900d056a846 |
2519839.pdf | 51467cf4516df4919c3b195ad67c10a668d339a705c4644ce60fd69f39f6730e |
2712444.pdf | 577c5f029ff827362b5a71d14f1e4a015bea3eb53960e250ffa1dde2f7ae0050 |
2765539.pdf | 59343aae861d86d9d360b4ccf0183f33a77e49b67696ee1f900821e7dad1f04e |
3179469.pdf | 669693d161926d705d63ac8fed895857549b4b7e5d82c2ead56a07c367616fb5 |
4170238.pdf | 86931ce5974bff673eb48aa4159b6c215efea4ce636f8e486e9fd54c14e33e9f |
4414331.pdf | 8e77a888f6ac85d24ac63e55810c2b2646ba18f540037ac748b50007f7c1c8c8 |
4512373.pdf | 91a3d6390adceb54e0ff993f8cfd58250f1bbabfd5ef061a7659ed019897d179 |
4977579.pdf | a09de5d289dd95d4b4b71d13e196e05db5ab5d228c65afcd74e5900a40a11b09 |
5198714.pdf | a7c81076098d7e179d13ab60a8da6c8897f71315060b73b959667e0f8ff385b9 |
5236677.pdf | a9031fc3fbaecf9abcb906e630fdeb90e71e1f9e3d78959ff5101e0fdaa7de65 |
5447694.pdf | afd19ad6ca780aa7c90756e97aa20fb11bf4781cfc0ee00e5bf23f66f940f51a |
6318895.pdf | cbeb29136aaa7b934c2b8616dafa4b8b9213235ed1be9c818c3858c990914275 |
6817632.pdf | dc0840305e174825fa1471dc2ab463bdffece4ec78b496b5e6a65245f4df4cc1 |
6940914.pdf | e004f3c7cf38f24ed278b9d3c30c5269f625ed66c623bd6f46ecb3aed9dac3d4 |
7241425.pdf | e9b4ec5975d197ffc9d199a188d68cc75cb323ecca545df0668c567bc04a769a |
7279847.pdf | eaf2d8ba2606262e861d5e8fe0b26b9c456d1fd3290c17d7c115dc14e02a73ca |
7407159.pdf | ef107d1cd9224d3582a1364b012f1585a6192ef1fa3267ab18c078777083091f |
7635694.pdf | f670fc79401a83b67f2695666803fb8e2ef2fe05a20c2880ea9f0b7465431523 |
7889525.pdf | fe9b31aa4fcf115ae893ffb2937558a11ee7c80ed9dd1908c3a9451ae8d3c140 |
We have also been informed that the following 25 files are missing. We are researching the root cause to see if these files can be recovered.
File name | sha256 |
5148444.pdf | a6263eb21c0cd9921f8ea19bd0ac29b25f8939b9ec9df6f72c9c283938003418 |
5217141.pdf | a86252e966475bd5841c58623969b44b6addd06d085c101d61ce992d4310e3e4 |
5429506.pdf | af3ad3b44dba45a7254fb2f02b4e761e0079042b122801edba356b521767672b |
5511351.pdf | b1de9844321d3bd425a01f4b323e7bc9235b8f3fe491069dcc6bc84ec065ea0a |
5588341.pdf | b45a2d388e608840beebfde5b3d54b2b8473255612500a25186c729cd554b2c4 |
5627292.pdf | b59b6f49184f8adc498ac8df859e9561e58b5bcfb2c636572578105a7932ba9a |
5729318.pdf | b8e4769302218e56b79a0be24c4c84feaaf4a7ca28e648499a5209e64a78d48c |
5818642.pdf | bbc5f99b12d4c63bc18a9147b886edfa39f3ef3010af4e250b2fce8cfe175050 |
6241601.pdf | c96945f0dfcbea92d8e7b73441b44713d4e8126ff2b6186841623e410145a534 |
6305828.pdf | cb7f2feff6f316bf1b8b02be0f235d2f1bd76c84eda311a2e10f4b779d337440 |
6564043.pdf | d3d4f3d648263881122fc8c4c16a56b1c6a8854c980310e272e2a4b1f1980a74 |
6567049.pdf | d3ee8804b8fb70fa49e4a2af79c3c1103ba9616c191bfbe78aaa9997c1cd4ebf |
6627962.pdf | d5e4eae9206ca7549268816f995d7862eb8435a2f1d744e7a490b249accf92cb |
6645396.pdf | d676433bb59b5734ea09bf0984c4e521c07ada0be65f7f57fbc551e344ebf9da |
6768494.pdf | da70d4360322380a94d7a9ea258a3b65805b7d6c94306357143ca131854f3ebf |
6918017.pdf | df4707852d2777b3bb9e6c14aa4142a491edad488814e37886e7f86967ed2a7f |
6976174.pdf | e127fef3a3db31fb5cddd8e19622de82cac969e1f14151a0584d9f15b52e2fb9 |
7156801.pdf | e6ff591b3718dd0ea5a530533f2ff9163e98673392b8d3f57da321c9c44e06b4 |
7503043.pdf | f229c1f9ef8f6be67d660cac3e452cf7a7870b95369fc46ee6f8fe0456c59698 |
7569116.pdf | f44c12f77ef8dc2eb0d3894f7025cc93e5ce4b34d58bdbcbb9fd5496b9629fe9 |
7622768.pdf | f606ff03ed252dc56292734a8b4bc1373ce6e29dc04f2bdcb212d4aaa5f7c6f5 |
7744455.pdf | f9ef08b4f915862afb0ea019b60020fce58da555524f15612a55e00fa05e19e8 |
7767671.pdf | faad69c320245cf76b1754d36ee63a9b01b970cbf6040158d0577a1f8ef14752 |
7850363.pdf | fd58e8fc878c61842da2c988638700e6be7443e3b1649c97ba39758c1d035d88 |
7917938.pdf | ff83b4c404c3681b31190675383a89cb847cc6b72f7c6d34847285abec0eeb77 |
Extras
1. How to extract an individual WARC from Common Crawl
First, users need the cc_warc_file
, the cc_warc_start
and the cc_warc_end
from the provenance table. We’ll use curl
and gunzip
. Let’s say we want to pull 0000000.pdf
which comes from crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz
starting at offset 3,724,499
and ends at offset 3,742,341
(inclusive).
- Prepend
https://data.commoncrawl.org/
to thecc_warc_file
to get the URL. - The http range will be:
3724499-3742341
- Fetch the gzipped WARC file:
curl -r 3724499-3742341 https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-31/segments/1627046154042.23/warc/CC-MAIN-20210731011529-20210731041529-00143.warc.gz -o 0000000.warc.gz
gunzip 0000000.warc.gz