Big Data

Datasets for projects

Dataset collections

Reddit submission corpus

All Reddit submissions (no discussion) from 2006 thru August 2015

40 GB compressed

Reddit top 2.5 million

This is a dataset of top posts from reddit. It contains the top 1,000 all-time posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count and are located in the manifest file within.

This data was pulled between August 15-20 of August 2013.

Stack Exchange (= Stack Overflow, Server Fault, Super User, ...)

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks.

World-Wide Web

Common Crawl is a nonprofit organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of 145 TB of data from 1.81 billion webpages as of August 2015. It completes four crawls a year. (Source: Wikipedia)

Wikipedia

Wikipedia offers free copies of all available content to interested users.

Audioscrobbler dataset

Audioscrobbler, which is now merged with last.fm, once published a database of what music people listened to with the audioscrobbler plugin. Last.fm no longer publishes it, however the initial releases were in the public domain so I can offer it for download.

135 MB compressed, 500MB uncompressed

Million song dataset

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital.

The metadata of a track includes information such as artist duration release (album name) title * year

The features derived from the audio track include bars beats danceability energy key loudness sections segments song_hotttnesss tatums tempo time_signature

Covertype dataset

Predicting forest cover type from cartographic variables only (no remotely sensed data)

NCDC Weather data

Worldwide surface weather observations from over 20,000 stations. Hourly measurements. Parameters included are: air quality, atmospheric pressure, atmospheric temperature/dew point, atmospheric winds, clouds, precipitation, ocean waves, tides and more.

For example data for year 2012:

Directory ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/2012
File 010010-99999-2012.gz
File 010014-99999-2012.gz

Network traffic anomaly detection

Raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.

This dataset was used for the KDD-99 competition

Enron email dataset

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation. The Enron data was originally collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by a litigation support and data analysis contractor to preserve and collect the vast amounts of data in the wake of the Enron Bankruptcy in December 2001.

Medline citation index

NY city taxi data

Data of taxi trips in NYC (GPS data)

List of trips. For each trip: Taxi identification Driver identification Start time / end time of trip GPS coordiantes of pick up and drop off * Fare