New Scenario: 2018 Lone Wolf

We are pleased to announce a new scenario in the digitalcorpora family!

Released today is the 2018 Lone Wolf Scenario, created by GMU student Thomas Moore. The scenario consists of more than 32GB (compressed) of data that was seized from a fictional individual who was planning a mass shooting.

The 2018 Lone Wolf Scenario is based on a (fictional) unstable individual who is planning a mass shooting. The individual is interrupted when a family member calls the police and his apartment is raided. The task for the investigators is to determine if anyone else was involved.

This scenario contains a disk image and memory dump from a laptop. It’s an image of a real, physical machine that was actually used, so it’s quite big. Also included in the scenario are the results of modern commercial digital forensics tools applied to the dataset, so that students who don’t have access to these tools can still see their results. There is a teacher’s guide that includes a report on all of the planted evidence.

The 2018 Lone Wolf Scenario was created by Thomas J. Moore, a student at George Mason University.

Please remember: this is a fictional scenario about fictional people!

Unlike the other scenarios on our website, this scenario also includes output of commercial forensic tools for student use. The idea is that there is nothing especially creative about running evidence through a tool, and a lot of students do not have access to state-of-the-art commercial tools, so we have run the tools for your students!

A teacher’s guide is available for this scenario.

You can find more information about the 2018 Lone Wolf scenario here: https://digitalcorpora.org/corpora/scenarios/2018-lone-wolf-scenario

Announcing New File Type Sample Files

UT San Antonio has kindly provided digitalcorpora with open source, publicly releasable samples of 32 file types. These are the samples that were used by Dr. Nicole Beebe to develop the Sceadan File Type Classifier.

Included file types are ASP, AVI, B64, B85, BZ2, CSS, DLL, ELF, EXE, EXT3, FAT, FLV, JAR, JB2, JS, M4A, MOV, MP3, MP4, NTFS, PST, RPM, RTF, Random, SWF, TXT, Tbird, URL, WAV, WMA, XLSX, ZIP. Each file type sample can be downloaded from the website:
* https://downloads.digitalcorpora.org/corpora/files/filetypes1/

Also included is a _README directory that includes a list of every file downloaded and a copyright statement for the files that are covered under copyright. You can access that directory at:
* https://downloads.digitalcorpora.org/corpora/files/filetypes1/_README/

This “FLETYPES1” corpus supplements the files in the GOVDOCS1 corpus.

Please let us know if you use these by including this citation in your paper:

“FILETYPES1 File type samples,” Beebe, Nicole, University of Texas, San Antonio, hosted at https://downloads.digitalcorpora.org/corpora/files/filetypes1/. 2014

35GB of JPEGs ready for download

We have created a tar and a ZIP file with 109,223 files from the govdocs1m corpus. You can download them from:

http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/files.jpeg.tar   [37.6 GB]

Browse all by type: http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/

Please note that the ZIP file is necessarily a ZIP-64 file and will not decompress with the ZIP implementation built-in to MacOS or Windows.

test disk image of emails available

I have created a new disk image called 2010-nps-emails that can be used for testing programs that find email addresses or perform string search.

The disk image consists of 30 different email addresses, each one stored in a different document with a different coding scheme.

Below are a list of the email addresses and their codings:

email address                             Application (Encoding)

plain_text@textedit.com                   Apple TextEdit  (UTF-8)
plain_text_pdf@textedit.com               Apple TextEdit print-to-PDF (/FlateDecode)
rtf_text@textedit.com                     Apple TextEdit (RTF)
rtf_text_pdf@textedit.com                 Apple TextEdit print-to-PDF (/FlateDecode)
plain_utf16@textedit.com                  Apple TextEdit (UTF-16)
plain_utf16_pdf@textedit.com              Apple TextEdit print-to-PDF (/FlateDecode)

pages@iwork09.com                         Apple Pages '09
pages_comment@iwork09.com                 Apple Pages (comment) '09
keynote@iwork09.com                       Apple Keynote '09
keynote_comment@iwork09.com               Apple Keynote '09 (comment)
numbers@iwork09.com                       Apple Numbers '09
numbers_comment@iwork09.com               Apple Numbers '09 (comment)

user_doc@microsoftword.com                Microsoft Word 2008 (Mac) (.doc file)
user_doc_pdf@microsoftword.com            Microsoft Word 2008 (Mac) print-to-PDF
user_docx@microsoftword.com
user_docx_pdf@microsoftword.com           Microsoft Word 2008 (Mac) print-to-PDF (.docx file)
xls_cell@microsoft_excel.com
xls_comment@microsoft_excel.com           Microsoft Word 2008 (Mac)
xlsx_cell@microsoft_excel.com             Microsoft Word 2008 (Mac)
xlsx_comment@microsoft_excel.com          Microsoft Word 2008 (Mac) (Comment)

doc_within_doc@document.com               Microsoft Word 2007 (OLE .doc file within .doc)
docx_within_docx@document.com             Microsoft Word 2007 (OLE .doc file within .doc)
ppt_within_doc@document.com               Microsoft PowerPoint and Word 2007 (OLE .ppt file within .doc)
pptx_within_docx@document.com             Microsoft PowerPoint and Word 2007 (OLE .pptx file within .docx)
xls_within_doc@document.com               Microsoft Excel and Word 2007 (OLE .xls file within .doc)
xlsx_within_docx@document.com             Microsoft Excel and Word 2007 (OLE .xlsx file within .docx)

email_in_zip@zipfile1.com                 text file within ZIP
email_in_zip_zip@zipfile2.com             ZIP'ed text file, ZIP'ed
email_in_gzip@gzipfile.com                text file within GZIP
email_in_gzip_gzip@gzipfile.com           GZIP'ed text file, GZIP'ed

The image can be downloaded from nps-2010-emails

Edit, 2011-11-26 19:32 PST: One email was incorrectly recorded above. xlsx_comment@microsoft_excel.com is within the disk image, but xlsx_cell_comment@microsoft_excel.com was recorded here. That is now corrected above.

Bots downloading disk images

I’m preparing some statistics on who (and what) are downloading the disk images we have here at digitalcorpora.org. The first thing that I’ve done is suppress the bots that are, for whatever reason, downloading the images.

Here’s the bots that we’ve found, and the number of times each image has been downloaded by a bot.

  Rank     Count     Value(s):
  ============================
      1      2334      Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
      2       851      MLBot (www.metadatalabs.com/mlbot)
      3       811      SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
      4       749      Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, crawler@dotnetdotcom.org)
      5       492      Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
      6       130      Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
      7       115      Mozilla/5.0 (compatible; DBLBot/1.0; +http://www.dontbuylists.com/)
      8       109      msnbot/2.0b (+http://search.msn.com/msnbot.htm)
      9       108      Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.sitebot.org/robot/)
     10        89      CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
     11        87      Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
     12        78      TwengaBot-Discover (http://www.twenga.fr/bot-discover.html)
     13        58      Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)
     14        51      msnbot/1.1 (+http://search.msn.com/msnbot.htm)
     15        26      Mozilla/5.0 (compatible; MJ12bot/v1.3.2; http://www.majestic12.co.uk/bot.php?+)
     16        21      Cityreview Robot (+http://www.cityreview.org/crawler/)
     17        18      'citeseerxbot'
     18        15      SindiceBot (heritrix/2.0.2 +http://sindice.com/developers/bot)
     19        12      Mozilla/5.0 (compatible; MJ12bot/v1.3.1; http://www.majestic12.co.uk/bot.php?+)
     20        11      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html
     21         9      Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
     22         7      CatchBot/3.0; +http://www.catchbot.com
                7      CyberPatrol SiteCat Webbot (http://www.cyberpatrol.com/cyberpatrolcrawler.asp)
                7      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
     25         6      Mozilla/5.0 (compatible; Search17Bot/1.1; http://www.search17.com/bot.php
                6      yacybot (amd64 Linux 2.6.26-2-xen-amd64; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
     27         5      MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
                5      yacybot (amd64 Linux 2.6.31-20-generic; java 1.6.0_15; Europe/en) http://yacy.net/bot.html
                5      yacybot (i386 Linux 2.6.32-trunk-686; java 1.6.0_18; America/en) http://yacy.net/bot.html
     30         3      msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
     31         2      Mozilla/5.0 (SnapPreviewBot) Gecko/20061206 Firefox/1.5.0.9
                2      yacybot (amd64 Linux 2.6.26-2-amd64; java 1.6.0_20; Europe/en) http://yacy.net/bot.html
                2      yacybot (amd64 Linux 2.6.28-18-generic; java 1.6.0_19; GMT/en) http://yacy.net/bot.html
                2      yacybot (i386 Linux 2.6.31-21-generic; java 1.6.0_0; Europe/en) http://yacy.net/bot.html
     35         1      Mozilla/5.0 (compatible; Googlebot/2.1;  http://www.google.com/bot.html
                1      Mozilla/5.0 (compatible; discobot/1.1; +http://discoveryengine.com/discobot.html)
                1      findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
                1      librabot/1.0 (+http://search.msn.com/msnbot.htm)
                1      yacybot (amd64 Linux 2.6.18-164.11.1.el5xen; java 1.6.0; Europe/en) http://yacy.net/bot.html
                1      yacybot (amd64 Linux 2.6.18-164.15.1.el5; java 1.6.0_14; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_18; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86 Windows XP 5.1; java 1.6.0_20; Europe/de) http://yacy.net/bot.html
                1      yacybot (x86_64 Mac OS X 10.6.4; java 1.6.0_20; America/en) http://yacy.net/bot.html 

Total items printed: 6242