S3 Information

The Digital Corpora project gets free hosting for the corpus as part of the AWS Open Data Sponsorship Program, for which we are grateful! We could not make this resource available without Amazon’s help.

Accessing the Corpus!

All of our site data is stored in the Amazon S3 bucket s3://digitalcorpora/.

You can download directly from that bucket. We recommend using the bucket directly in Amazon’s cloud, which will give you the fastest access to the data.

The Fast JavaScript Gateway

You can browse the S3 bucket directly using our JavaScript-based browser here: [S3 Browser]. It is fast, but it will not work with wget -r to download many files at once.

The Annotated S3 Gateway

You can also browse using our server-based S3 gateway: [S3 Gateway]. It’s written in python and hosted on Dreamhost, the same ISP that hosts this WordPress site. It provides additional annotations, such as cryptographic hash codes and viewing of the README files. You can find the source code here. If there is a problem with the downloads site, you can try the development site or the backup site

The Unix command line

You can also access this resource from the AWS command line interface with the command:

$ aws s3 ls s3://digitalcorpora/corpora/
                           PRE bin/
                           PRE drives/
                           PRE drives_bulk_extractor/
                           PRE drives_dfxml/
                           PRE files/
                           PRE hashes/
                           PRE mobile/
                           PRE packets/
                           PRE ram/
                           PRE scenarios/
                           PRE sql/
2020-11-21 10:56:19         43 README.txt
2020-11-21 10:56:20    1783404 digitalcorpora.org-hashdeep-2020-04-01.csv
2020-11-21 10:56:19    1787101 digitalcorpora.org-hashdeep-2020-05-01.csv
2020-11-21 10:56:19    1794086 digitalcorpora.org-hashdeep-2020-06-01.csv
2020-11-21 10:56:19    1794914 digitalcorpora.org-hashdeep-2020-07-01.csv
2020-11-21 10:56:20    1796103 digitalcorpora.org-hashdeep-2020-08-01.csv
2020-11-21 10:56:20    1796275 digitalcorpora.org-hashdeep-2020-09-01.csv
2020-11-21 10:56:20    1796447 digitalcorpora.org-hashdeep-2020-10-01.csv
2020-11-21 10:56:20    1796619 digitalcorpora.org-hashdeep-2020-11-01.csv
$

Download Statistics

You can find download statistics at https://stats.digitalcorpora.org/reports.

Tech Details

https://downloads.digitalcorpora.org/ is an app run on a Dreamhost virtual domain that has a Python program which makes the contents of the Amazon S3 bucket s3://digitalcorpora/ look like a Unix directory that is delivered using the Apache web server’s directory listing facility.

The s3://digitalcorpora/ website is hosted in the AWS region US West (Oregon) us-west-2.

The fastest way to access the corpus is by creating an EC2 VM in us-west-2 and accessing it directly.