Run SparkPi on SJTU Pi Supercomputer

Install Spark 2.1 to $HOME

Download and extract Spark 2.1 to your $HOME directory.

$ export http_proxy=http://proxy.hpc.sjtu.edu.cn:3004
$ export https_proxy=https://proxy.hpc.sjtu.edu.cn:3004
$ cd
$ wget https://d3kbcqa49mib13.cloudfront.net/spark-2.1.1-bin-hadoop2.7.tgz
$ tar -xzvpf spark-2.1.1-bin-hadoop2.7.tgz

Request resource from SLURM scheduling system

Request 2 or more exclusive nodes from SLURM:

$ srun -p cpu -N 2 --exclusive /bin/bash

Now you are in an shell environment control by SLURM. You can check the resource in use by:

hosntame
gpu35
gpu36
squeue -u `whoami`

Please keep this session open and start a new terminal window to do the reset operation.

Start Spark service

Start Spark master on the first node of your request resource. Say, the server for spark maser is gpu35.

mu05$ ssh gpu35
gpu35$ spark-2.1.1-bin-hadoop2.7/sbin/start-master.sh

Please confirm spark master is properly started by checking the log. URL spark://xxx:7077 will be used for worker registration later.

gpu35$ tail /lustre/home/acct-hpc/hpclc/spark-2.1.1-bin-hadoop2.7/logs/spark-hpclc-org.apache.spark.deploy.master.Master-1-gpu35.out
17/06/04 16:35:59 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
17/06/04 16:35:59 INFO Master: Starting Spark master at spark://gpu35:7077
17/06/04 16:35:59 INFO Master: Running Spark version 2.1.1
17/06/04 16:35:59 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/06/04 16:35:59 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://180.0.6.35:8080
17/06/04 16:35:59 INFO Utils: Successfully started service on port 6066.
17/06/04 16:35:59 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/06/04 16:35:59 INFO Master: I have been elected leader! New state: ALIVE 

Then start slave service on all other nodes. After that, Spark workers will be registered to the master. There is only one slave node, gpu36, in this tutorial. Please repeat this process if there are other slave nodes.

mu05$ ssh gpu36
gpu36$ spark-2.1.1-bin-hadoop2.7/sbin/start-slave.sh spark://gpu33:7077

Confirm all workers have been added sucessfully by checking the master log.

gpu35$ tail /lustre/home/acct-hpc/hpclc/spark-2.1.1-bin-hadoop2.7/logs/spark-hpclc-org.apache.spark.deploy.master.Master-1-gpu35.out
17/06/04 16:41:52 INFO Master: Registering worker 180.0.6.36:38869 with 16 cores, 61.8 GB RAM

Submit a SparkPi job to this Spark cluster

Now you can submit a SparkPi job from any node on Pi supercomputer. This command specifies the master node (spark://gpu35:7077), memory limit (48GB), total executor (16) and number of job slices (1000). This job usually takes less than 1 minute to finish.

mu05$ spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://gpu35:7077 \
--executor-memory 48G \
--total-executor-cores 16 \
~/spark-2.1.1-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.1.jar \
1000

Release resource after job completion

Stop Spark slave services on slave nodes.

gpu36$ spark-2.1.1-bin-hadoop2.7/sbin/stop-slave.sh

Stop Spark master.

gpu35$ spark-2.1.1-bin-hadoop2.7/sbin/stop-master.sh

Terminate all SSH sessions so SLURM will terminate the resource requested by srun.

Reference