Setting-up Apache Spark in Standalone Mode

A thorough approach for parallel computation cluster setup

Rahul Dubey
Towards Data Science

--

In this post we are going to setup Apache Spark in Standalone cluster mode. Apache Spark is a framework which is used to solve Big Data problems like Data Processing, Feature Engineering, Machine learning and working with streaming data. Apache Spark provides a way to distribute your work load with several worker nodes either using Standalone, YARN or MESOS Cluster manager for parallel computation. Simplest of them is Standalone Cluster manager which doesn’t require much tinkering with configuration files to setup your own processing cluster.

I’ll not cover conceptual details in this post but rather practical ones. For a cluster with multiple nodes, I’ll use two nodes with one machine having Ubuntu installed standalone and other machine have VM installed with Bridged Adapter activated. If Bridged Adapter is not activated, follow the steps below.

Setting up VMs for host machine IP address sharing

1. Select machine and then go to settings (image by author)
2. Switch to Network tab and select Adapter 1. After this check “Enable Network Adapter” if unchecked. Select “Bridged Adapter” from drop down box. (image by author)

To check your if your IP address is being shared with VMs, open and login to your VM machine and type in terminal:

#If "ifconfig" not installed
$ sudo apt-get install net-tools
$ ifconfig

You should see your IP address same as your host machine.

Setting up Apache Spark Environment

Once we are done with setting basic network configuration, we need to set Apache Spark environment by installing binaries, dependencies and adding system path to Apache Spark directory as well as python directory to run Shell scripts provided in bin directory of Spark to start clusters.

-----PYTHON SETUP-----
#Check if "python3" is installed
$ python3 --version
#If "pyhon3" not installed, then type commands below
#This will install python 3.6.9
$ sudo apt-get install python3.6
#Check if pip is installed, since we might need it in future
$ pip3 --version
#If pip3 not installed, then type commands below
$ sudo apt install python3-pip
#Also install some other dependencies
$sudo apt-get install python3-numpy python3-matplotlib python3-scipy
-----JAVA SETUP----
#Check if JAVA is installed, this is required by Spark which works #underneath it.
$ java -version
#If not installed, type the command below
#This will install JAVA-11
$ sudo apt install default-jdk
#Once JAVA installed, add it to the environment path
$ sudo update-alternatives --config java
#copy the path from above command and paste it to environment as #"JAVA_HOME=/path/to/JAVA"
$ sudo nano /etc/environment

We need to setup SSH-key using SSH-keygen to allow Spark access the machines with pre-authorisation.

#Install ssh
$ sudo apt-get install openssh-server openssh-client
#Edit the sshd_config file to allow certain permissions
$ sudo nano /etc/ssh/sshd_config
PermitRootLogin yes
PubkeyAuthentication yes
PasswordAuthentication yes
#Restart sshd services to reflect the changes in file
$ services sshd restart

All the steps till now needs to be done on all the nodes. We need to generate public key now and copy it to other nodes for password-less authentication.

Perform steps below only for node on which you are going to start Spark master.

# Generate key with rsa algorithm
#This will create a .ssh directory at /home/usr
$ ssh-keygen -t rsa
#Skip paraphrase and other information by pressing "ENTER"#Copy id_rsa.pub public key using scp to all other nodes' root #directories. This will ask you to enter password for root@worker_1_IP #and root@worker_2_IP first time.
$ scp .ssh/id_rsa.pub root@worker_1_IP:
$ scp .ssh/id_rsa.pub root@worker_2_IP:
#Change to root directory by typing "cd ~".#Switch to root@worker_1_IP and root@worker_2_IP and copy id_rsa.pub #from root directory to home directory of user.
$ ssh root@worker_1_IP (and for worker_2 also)
$ cp id_rsa.pub /home/usr/.ssh/authorized_keys
#Above step will ask you to overwrite key if already present.#Once done with above steps, check if password-less authorisation #works from your Master node.
$ ssh usr@worker_1_IP
$ ssh usr@worker_2_IP
NOTE: I would advice you to keep the user name for the machines in VM same if security is not an issue because the Spark requires Master node user (in my case "rahul") to be given permission for login as usr@worker_1_IP and usr@worker_2_IP where usr has to be same (in my case "rahul").#Another alternative is to create users on salve machines with sudo permissions.

Below, we install Apache Spark and tinker with Configuration files to setup clusters. We have installed all the dependencies and network configuration to get Spark run error free. Perform these steps for all nodes.

#Download Apache Spark-3.0.1 with hadoop2.7 tar file
$ wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7 tgz
#Check if tar file is present
$ ls -la
#Extract the tar file
$ tar xvf spark-*
#Move the file to /opt/spark
$ sudo mv spark-3.0.1-bin-hadoop2.7 /opt/spark
#/opt/spark is going to be your SPARK_HOME directory. Also add #python3 directory path as PYSPARK_PYTHON. These change are to be #done in .profile file
$ sudo nano .profile
#Add following variables and path
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3
#save and close .profile file and recompile it
$ source .profile

Now we edit configuration files for only Master node. Do not perform these steps on slave nodes.

#Copy and edit spark-env.sh.template and slave.template file
$ cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark- env.sh
$cp $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves#Open spark-env.sh file and add JAVA_HOME and other configurations
#like Memory or CPU core numbers
$ sudo nano $SPARK_HOME/conf/spark-env.sh
#add at the end of the file
export JAVA_HOME=/path/to/JAVA or $JAVA_HOME(from /etc/environment)
export SPARK_MASTER_HOST=master_IP
#Open slaves file and add IP address of slaves. Delete "localhost" #if present
worker_1_IP
worker_2_IP
#Save the file above and run command below to check if all nodes #starts
$ $SPARK_HOME/sbin/start-all.sh
#Go to web browser and type
http://127.0.0.1:8080/
#This will show the number of workers, their running status, their configuration and other information.
#To stop the Master and workers
$ $SPARK_HOME/sbin/stop-all.sh
#To start SPARK SHELL which will run SCALA programming language
#Quit it using ":q" and enter
$ $SPARK_HOME/bin/spark-shell
#To start SPARK SHELL with PYTHON programming language
# Quit it using "exit()" and enter
$ $SPARK_HOME/bin/pyspark

Submitting Application to Cluster

We can check if our cluster if functioning by submitting an application. A Spark Application detects SparkContext instance which holds the SparkConf object which specifies whether the application has to run in Local processes or Cluster processes. We will cover this in future posts, but for now just run the code below to calculate the value of Pi.

#IP=MASTER node IP address
#PORT=7077 for me
#You can check your master address by opening #https://127.0.0.1:8080/
$ MASTER=spark://IP:PORT $SPARK_HOME/bin/run-example org.apache.spark.examples.SparkPi
#You should get a value and bunch of warnings

You should see the similar screen on Spark UI as shown below.

SparkUI: Job completion (image by author)

Conclusion

In this post, we have learnt how to setup your own cluster environment in Apache Spark and how to submit a simple application to it and navigate the states of nodes using Spark UI. In the next posts I’ll explain how to configure your applications to get executed in the cluster environment and save their results in the form of RDDs (Resilient Distributed Datasets).

--

--