search instagram arrow-down




Hadoop and Spark Cluster Setup

Hadoop Cluster setup on Ubuntu 16.04

Its difficult to find proper documentation on how to setup the hadoop and spark cluster. Here I am going to mention precise and clear steps on setting the clusters.

Hadoop Clusters


  1. NameNode  (
  2. DataNode1  (  200 GB disk
  3. DataNode2   (  200 GB disk
  4. DataNode3  (   200 GB disk

Its better to provide the uniform storage across the DataNode1. Since replication is itself provided by the Hadoop clusters, it not recommended to setup the individual RAID partition in the DataNode.

Step 1: Configure passwordless ssh

Install Ubuntu 16.04 on all the machine and configure the passwordless ssh from NameNode to all of the Node. Also configure passwordless ssh from each node to other. It can simply be done by generating ssh private and public key pair using ssh-keygen  command and using ssh-copy-id command to remotely copy the public key to the the machine where you want to ssh without using password.

ssh-keygen -t rsa
ssh-copy-id daya@

Step 2: Configure host based resolution if DNS is not present

In each node keep the following to resolve the hostname. localhost127.0.0.1 localhost  namenode  datanode1  datanode2  datanode3

Step 3: Download spark-2.0.0, Hadoop-2.7.3 and JDK1.7 binary source from their respective download repo. Extract them your home directory or any other directory of your choice.

Step 3: Manage the Environmental variables

In order to manage the environmental variable use .profile file and use these settings.

export JAVA_HOME=/home/daya/jdk1.7
export SPARK_HOME=/home/daya/spark-2.0.0
export HADOOP_HOME=/home/daya/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Step 4: Hadoop configuration files in all Nodes

  2. core-site.xml
  3. yarn-site.xml
  4. mapred-site.xml
export JAVA_HOME=/home/daya/jdk1.7

Inside the configuration tag. 

Similar to core-site.xml put these settings inside configuration tag. 


Step 5: NameNode specific configuration

  1. hdfs-site.xml
  2. masters
  3. slaves
3 is the replication factors;since there are 3 datanodes. 



Step 6: DataNode Specific configuration


Step 7: Format the NameNode

hdfs namnode -format


Once Successfully installed, status can be checked via the web interface: Shot 2017-07-21 at 9.41.55 PM.png


Spark Cluster on the top of Hadoop Clusters.

We can utilize the same hadoop cluster to run the Spark clusters. Spark clusters is comparatively  easy to setup than hadoop counterpart. First we set the spark environmental variables and setup the paths to point to bin and sbin binaries of spark. It has already been ported in the .profile file mentioned above.

Start the spark master
cd /home/daya/spark-2.0.0/sbin 

Start the spark slave
./ spark://namenode:7077

Here namenode is the the spark master. run in all the remaining data node. 


Spark clusters status can be checked via 8080 port

Screen Shot 2017-07-21 at 9.47.08 PM.png

To submit the jobs to spark reading data from the hadoop clusters:

spark-submit --master spark://namenode:7077 --class TAQExtract TAQExtract.jar hdfs://namenode:9000/finance-dataset/taqnbbo20150824 hdfs://namenode:9000/finance-dataset/taqnbbo20150824_result


Web based notebook for interactive data analytics and collaborative documents with SQL, Scala and more.

Download the Zeeplin.

cd /home/daya/zeppelin-0.7.2-bin-all/conf
export SPARK_HOME=/home/daya/spark-2.0.0
export JAVA_HOME=/home/daya/jdk1.7
./ start










Since spark is working on 8080 you have to change the zeppline port to another port, also specify the IP address of the zeppline server in zeppelin-site.xml inside conf directory. 

<property>   <name>zeppelin.server.addr</name>   <value></value>   <description>Server address</description> </property> <property>   <name>zeppelin.server.port</name>   <value>8010</value>   <description>Server port.</description> </property>

Basic Troubleshooting:

NameNode and Datanode not starting problem:

It may be the case that sometimes datanode doesn’t start and sometimes name nodes doesn’t start.  For it you have to remove the  directory that were created in the /tmp directory. Also specify the new directory of hadoop in conf file and then  &&


Leave a Reply
Your email address will not be published. Required fields are marked *

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: