Download and extract Spark 2.1 to your $HOME directory.
$ cd $ wget https://d3kbcqa49mib13.cloudfront.net/spark-2.1.1-bin-hadoop2.7.tgz $ tar -xzvpf spark-2.1.1-bin-hadoop2.7.tgz
Request 2 or more exclusive nodes from SLURM:
$ srun -p cpu -N 2 --exclusive /bin/bash
Now you are in an shell environment control by SLURM. You can check the resource in use by:
hosntame gpu35 gpu36 squeue -u `whoami`
Please keep this session open and start a new terminal window to do the reset operation.
Start Spark master on the first node of your request resource. Say, the server for spark maser is gpu35.
mu05$ ssh gpu35 gpu35$ spark-2.1.1-bin-hadoop2.7/sbin/start-master.sh
Please confirm spark master is properly started by checking the log.
spark://xxx:7077 will be used for worker registration later.
gpu35$ tail /lustre/home/acct-hpc/hpclc/spark-2.1.1-bin-hadoop2.7/logs/spark-hpclc-org.apache.spark.deploy.master.Master-1-gpu35.out 17/06/04 16:35:59 INFO Utils: Successfully started service 'sparkMaster' on port 7077. 17/06/04 16:35:59 INFO Master: Starting Spark master at spark://gpu35:7077 17/06/04 16:35:59 INFO Master: Running Spark version 2.1.1 17/06/04 16:35:59 INFO Utils: Successfully started service 'MasterUI' on port 8080. 17/06/04 16:35:59 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://184.108.40.206:8080 17/06/04 16:35:59 INFO Utils: Successfully started service on port 6066. 17/06/04 16:35:59 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 17/06/04 16:35:59 INFO Master: I have been elected leader! New state: ALIVE
Then start slave service on all other nodes. After that, Spark workers will be registered to the master. There is only one slave node, gpu36, in this tutorial. Please repeat this process if there are other slave nodes.
mu05$ ssh gpu36 gpu36$ spark-2.1.1-bin-hadoop2.7/sbin/start-slave.sh spark://gpu33:7077
Confirm all workers have been added sucessfully by checking the master log.
gpu35$ tail /lustre/home/acct-hpc/hpclc/spark-2.1.1-bin-hadoop2.7/logs/spark-hpclc-org.apache.spark.deploy.master.Master-1-gpu35.out 17/06/04 16:41:52 INFO Master: Registering worker 220.127.116.11:38869 with 16 cores, 61.8 GB RAM
Now you can submit a SparkPi job from any node on Pi supercomputer.
This command specifies the master node (
spark://gpu35:7077), memory limit (48GB), total executor (16) and number of job slices (1000).
This job usually takes less than 1 minute to finish.
mu05$ spark-2.1.1-bin-hadoop2.7/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://gpu35:7077 \ --executor-memory 48G \ --total-executor-cores 16 \ ~/spark-2.1.1-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.1.jar \ 1000
Stop Spark slave services on slave nodes.
Stop Spark master.
Terminate all SSH sessions so SLURM will terminate the resource requested by