Preparation for using Spark
First download Spark (the binary version compiled with hadoop) form here.
And then open the archive and goto inside Spark
directory and run
./bin/run-example SparkPi 10
It should compute Pi
without any error message.
Requirements
Spark runs on Java 7+, Python 2.6+ and R 3.1+.
For the Scala API, Spark 1.5.1 uses Scala 2.10.
You will need to use a compatible Scala version (2.10.x).
You should also install sbt
to compile and package a spark project.
Installing sbt
You can install sbt
on your machine depending on your operating system by following the instructions given at scala-sbt website.
- On Mac, we suggest to use Macports.
- On Linux, just follow the instructions given on the scala-sbt website.
- On Windows, you can also use Babun to install
sbt
.
Interactive Analysis with the Spark Shell
We use Spark shell to learn the framework. Run following command from Spark home directory:
./bin/spark-shell
We are going to process README.md
file of Spark.
Run following commands on Spark shell and explain each line and show the results:
scala> val textFile = sc.textFile("README.md")
scala> textFile.count
scala> textFile.first
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
scala> textFile.filter(line => line.contains("Spark")).count
Self-Contained Applications
Now we create a very simple Spark application in Scala.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setMaster("local[2]").setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Note that applications should define a main()
method instead of extending scala.App
.
Subclasses of scala.App
may not work correctly.
Question: What does this program?
Our application depends on the Spark API, so we'll also include an sbt
configuration file,
build.sbt
which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
For sbt to work correctly, we'll need to layout SimpleApp.scala
and build.sbt
according
to the typical directory structure. Once that is in place, we can create a JAR package
containing the application's code, then use the spark-submit
script to run our program.
# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Compile the project and package a jar containing your application
$ sbt clean compile package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.10/simple-project_2.10-1.0.jar
...
Question: What is the output of the program?
Creating Eclipse Project
You can create an eclipse project using sbt
.
To do so, you need to define the eclipse plugin for sbt.
Add a file called plugins.sbt
into project
folder with the following content:
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
Then run the following command:
$ sbt eclipse
Now you can import your project into eclipse using Import and then by selecting Existing Projects into Workspace. You can run your Spark application like any Scala application through eclipse.
Reference: You can find Spark documentation on the Spark website.