Website:
http://spark.apache.org/
Developer(s):
Apache / Open Source
System(s):
Topaz
License:
Apache License, version 2.0 (comparable to GPL)

Apache Spark is a fast and general engine for large-scale data processing. It is an open source framework that supports applications written in Java, Scala, Python, and R. Any Spark application has the potential to be interactive via the chosen language's shell. Spark provides libraries for SQL and DataFrames, MLib for machine learning, GraphX, and Spark Streaming. These can be combined into the same application.

Installation

Apache Spark will be installed on supported HPC systems in $DAAC_HOME/spark/[spark-VERSION]. The most current version will be set as default for all users. (Currently the default version is spark-1.6.2)

Use the following module commands in your PBS script or interactive session to load the necessary environment variables:

% module load spark
% module load java

(Note: If using R, remember to execute "module load R" to set the environment variables first.)

Example Scripts

Apache Spark comes with example scripts in all supported languages. The example code is located in $DAAC_HOME/spark/examples/src/main .

Note: These scripts show the basic usage of Apache Spark. The code setup for running on multiple nodes is not included in this directory. The information needed to run on multiple nodes on Topaz is contained in the section below entitled MULTIPLE NODES.

There is also a GitHub repository in: $DAAC_HOME/spark/data_analysis_examples

Execute "download_data.sh" to copy source data to your $WORKDIR

For more information, consult README.md

Multiple Nodes

To run Spark on multiple nodes, your PBS script must include a code snippet to setup the environment. This code performs the following functions:

  1. Creates a list of nodes allocated to your PBS job.
  2. Starts a master script on the "head" node.
  3. Starts worker scripts on each remaining node in the list and connects back to the master as a worker node.

Note: The steps outlined above (shown implemented in the PBS snippet below) provided a successful Spark setup, as tested by the DAAC. This seemed a straight-forward solution; if you find a more efficient way to setup for multiple nodes, please let us know. We'd be interested in evaluating alternate solutions.

[The code below demonstrates how to setup a multiple node job on Topaz and launch an example program file, pi.py, which computes the value of pi]

#!/bin/bash
#
#PBS -l select=3:ncpus=36:mpiprocs=36,walltime=00:10:00
#PBS -N spark-test
#PBS -A [YOUR_PROJECT_HERE]
#PBS -q debug

nodes=($( cat $PBS_NODEFILE | sort | uniq ))
nnodes=${#nodes[@]}
last=$(( $nnodes - 1 ))

source $MODULESHOME/init/bash

module load java
module load spark

#creates cat_master_test.sh to launch master node
cat > $WORKDIR/cat_master_test.sh <<EOT
#!/bin/bash

source $MODULESHOME/init/bash

module load java
module load spark

cd ${SPARK_HOME}
./sbin/start-master.sh

EOT

#change permissions to executable
chmod +x $WORKDIR/cat_master_test.sh

#checks $SHELL and executes appropriate commands
if [ "$SHELL" = '/bin/csh' ]; then
        ssh ${nodes[0]} 'set prompt="hello>"; source /etc/profile.d/zzz-hpcmp.csh; $WORKDIR/cat_master_test.sh'
else
        ssh ${nodes[0]} '$WORKDIR/cat_master_test.sh'
fi

#creates cat_worker_test.sh to launch worker nodes and connect back to master
cat > $WORKDIR/cat_worker_test.sh <<EOT
#!/bin/bash

source $MODULESHOME/init/bash

module load java
module load spark

cd ${SPARK_HOME}
./sbin/start-slave.sh spark://${nodes[0]}:7077

EOT

#change permissions to executable
chmod +x $WORKDIR/cat_worker_test.sh

#for each remaining node in the list, execute startup for worker
for i in $(seq 1 $last)
    do
        if [ "$SHELL" == '/bin/csh' ]; then
                ssh ${nodes[$i]} 'set prompt="hello>"; source /etc/profile.d/zzz-hpcmp.csh; $WORKDIR/cat_worker_test.sh'
        else
                ssh ${nodes[$i]} '$WORKDIR/cat_worker_test.sh'
        fi
    done

#cleanup
rm $WORKDIR/cat_master_test.sh
rm $WORKDIR/cat_worker_test.sh

#execute test program from Apache, pi.py
${SPARK_HOME}/bin/spark-submit --master spark://${nodes[0]}:7077 ${SPARK_HOME}/examples/src/main/python/pi.py

Help

For full documentation, programming guides, and more examples, please consult Spark's official website: http://spark.apache.org/

For other support needs related to Spark's installation on HPCMP systems, please contact support@daac.hpc.mil.

External Links