Big Data


The projects start on Tuesday 01.12.2015. Time allocated for the project is 3 sessions plus one session for project presentations. Students will work on a project in groups of at most three. All groups will present the results of their projects via a poster on Tuesday 05.01.2016. The poster will contain:

  1. The problem's importance and motivation for solving it

  2. The assumptions that you made, along with any problems that you were unable to solve

  3. The analysis of your results

  4. The evaluation of your results

  5. The conclusions (possible improvements or extensions of the project)

The project can be any topic related to Big Data Analytics using the MapReduce or Spark programming models.

Please send your project proposal and the name of your group members before Tuesday 01.12.2015 start of class to fatemeh.borran at We encourage you to define your own subject of interest. The project description should specify the algorithms to be used, the dataset(s) and the evaluation method. You can find below project topic examples.

Dataset examples

Project topic examples

01 Recommending Music and the Audioscrobbler Data Set


The goal is to implement a music recommendation system. It uses a dataset published by Audioscrobbler. Audioscrobbler was the first music recommendation system for, one of the first Internet streaming radio sites, founded in 2002. It contains listeners’ plays of artists’ songs.


02 Predicting Forest Cover with Decision Trees


The goal of this project is use cartographic variables to classify forest categories.

The data set records the types of forest covering parcels of land in Colorado, USA. Each example contains several features describing each parcel of land, like its elevation, slope, distance to water, shade, and soil type, along with the known forest type covering the land. The forest cover type is to be predicted from the rest of the features, of which there are 54 in total. The recommended algorithm is Decision Trees.


03 Anomaly Detection in Network Traffic with K-means Clustering


The goal is to detect anomalous behavior in the network traffic of an organization. Anomalous behavior can point to things like intrusion attempts, denial-of-service attacks, port scanning, etc.

The dataset was generated for a data analysis competition from raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. The LAN was operated as if it were a true Air Force environment, but peppered with multiple attacks. Feature extraction was already run on the data, the dataset contains a list of connections, and for each connection 38 features, like the number of bytes sent, login attempts, TCP errors, and so on.

As the data is not labeled, an unsupervised learning algorithm is applied, more specifically K-means clustering. The idea is to let the clustering discover normal behavior. Connections that fall outside of clusters are potentially anomalous.


04 Understanding Wikipedia with Latent Semantic Analysis


The goal is to apply Latent Semantic Analysis (LSA) to a corpus consisting of the full set of articles contained in Wikipedia, about 46 GB of raw text. LSA is a technique in natural language processing and information retrieval that seeks to better understand a corpus of documents and the relationships between the words in those documents. It uses a linear algebra technique called singular value decomposition (SVD).

The project will allow finding important concepts in the collection, to calculate relevance of queries and terms, to find out term-term relevance and document-document relevance.


05 Analyzing Co-occurrence Networks with GraphX


The goal is to use a Spark library called GraphX to analyze a database of medical citations called the MEDLINE Database.

MEDLINE (Medical Literature Analysis and Retrieval System Online) is a database of academic papers that have been published in journals covering the life sciences and medicine. Its citation index tracks the publication of articles across thousands of journals.

Due to the volume of citations and the frequency of updates, the research community has developed an extensive set of semantic tags called MeSH (Medical Subject Headings) that are applied to all of the citations in the index. These tags provide a meaningful framework that can be used to explore relationships between documents.

The goal here is be to get a feel for the shape and properties of the citation graph. This can be done in few different ways. For example, by looking at the major topics and their co-occurrences, a simpler analysis that doesn’t require using GraphX. OR by looking for connected components — can one follow a path of citations from any topic to any other topic, or is the data actually a set of separate smaller graphs?


06 Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data


The goal is to calculate the utilization of New York city taxi cabs: the fraction of time that a cab is on the road and is occupied by one or more passengers and is generating income for the taxi driver. The utilization varies by borough (Manhattan, Brooklyn, Queens, ...) and is thus a function of the passenger's destination.

The dataset comes from the city of New York which tracks the GPS position of each taxi. It contains a list of rides. For each ride there is information about the taxi and the driver, the time the trip started and ended and the GPS coordiantes where the passenger(s) were picked up and where they were dropped off.

Analysis uses specialized libraries for doing arithmetic with temporal data and mapping GPS coordinates to boroughs. A sessionazation algorithm allows to identify the behavior of an individual taxi driver throughout his/her day.



The examples were chosen from the book

Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
Advanced Analytics with Spark
April 2015