Big Data Analytics – Preparation for using Spark

Preparation for using Spark

First download Spark (the binary version compiled with hadoop) form here. And then open the archive and goto inside Spark directory and run

./bin/run-example SparkPi 10

It should compute Pi without any error message.

Requirements

Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.5.1 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). You should also install sbt to compile and package a spark project.

Installing `sbt`

You can install sbt on your machine depending on your operating system by following the instructions given at scala-sbt website.

On Mac, we suggest to use Macports.
On Linux, just follow the instructions given on the scala-sbt website.
On Windows, you can also use Babun to install sbt.

Interactive Analysis with the Spark Shell

We use Spark shell to learn the framework. Run following command from Spark home directory:

./bin/spark-shell

We are going to process README.md file of Spark. Run following commands on Spark shell and explain each line and show the results:

scala> val textFile = sc.textFile("README.md")

scala> textFile.count

scala> textFile.first

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

scala> textFile.filter(line => line.contains("Spark")).count

Self-Contained Applications

Now we create a very simple Spark application in Scala.

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
    def main(args: Array[String]) {
        val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
        val conf = new SparkConf().setMaster("local[2]").setAppName("Simple Application")
        val sc = new SparkContext(conf)
        val logData = sc.textFile(logFile, 2).cache()
        val numAs = logData.filter(line => line.contains("a")).count()
        val numBs = logData.filter(line => line.contains("b")).count()
        println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
    }
}

Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.

Question: What does this program?

Our application depends on the Spark API, so we'll also include an sbt configuration file, build.sbt which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"

For sbt to work correctly, we'll need to layout SimpleApp.scala and build.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application's code, then use the spark-submit script to run our program.

# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Compile the project and package a jar containing your application
$ sbt clean compile package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.10/simple-project_2.10-1.0.jar
...

Question: What is the output of the program?

Creating Eclipse Project

You can create an eclipse project using sbt. To do so, you need to define the eclipse plugin for sbt. Add a file called plugins.sbt into project folder with the following content:

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")

Then run the following command:

$ sbt eclipse

Now you can import your project into eclipse using Import and then by selecting Existing Projects into Workspace. You can run your Spark application like any Scala application through eclipse.

Reference: You can find Spark documentation on the Spark website.