search instagram arrow-down

Archives

Categories

Meta

Hadoop and Spark Cluster Setup

Hadoop Cluster setup on Ubuntu 16.04

Its difficult to find proper documentation on how to setup the hadoop and spark cluster. Here I am going to mention precise and clear steps on setting the clusters.

Hadoop Clusters

Environment:

  1. NameNode  (10.0.0.190)
  2. DataNode1  (10.0.0.191)  200 GB disk
  3. DataNode2   (10.0.0.192)  200 GB disk
  4. DataNode3  (10.0.0.193)   200 GB disk

Its better to provide the uniform storage across the DataNode1. Since replication is itself provided by the Hadoop clusters, it not recommended to setup the individual RAID partition in the DataNode.

Step 1: Configure passwordless ssh

Install Ubuntu 16.04 on all the machine and configure the passwordless ssh from NameNode to all of the Node. Also configure passwordless ssh from each node to other. It can simply be done by generating ssh private and public key pair using ssh-keygen  command and using ssh-copy-id command to remotely copy the public key to the the machine where you want to ssh without using password.

ssh-keygen -t rsa
ssh-copy-id daya@10.0.0.191

Step 2: Configure host based resolution if DNS is not present

In each node keep the following to resolve the hostname.

127.0.0.1 localhost127.0.0.1 localhost
10.0.0.190  namenode
10.0.0.191  datanode1
10.0.0.192  datanode2
10.0.0.193  datanode3

Step 3: Download spark-2.0.0, Hadoop-2.7.3 and JDK1.7 binary source from their respective download repo. Extract them your home directory or any other directory of your choice.

Step 3: Manage the Environmental variables

In order to manage the environmental variable use .profile file and use these settings.

export JAVA_HOME=/home/daya/jdk1.7
export SPARK_HOME=/home/daya/spark-2.0.0
export HADOOP_HOME=/home/daya/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin

Step 4: Hadoop configuration files in all Nodes

  1. hadoop-env.sh
  2. core-site.xml
  3. yarn-site.xml
  4. mapred-site.xml
hadoop-env.sh
cd $HADOOP_CONF_DIR
vi hadoop-env.sh
export JAVA_HOME=/home/daya/jdk1.7

core-site.xml
Inside the configuration tag. 
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://namenode:9000</value>
</property>

yarn-site.xml
Similar to core-site.xml put these settings inside configuration tag. 
<property>
 <name>yarn.nodemanager.aux-services</name>
 <value>mapreduce_shuffle</value>
 </property>
 <property>
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
 <property>
 <name>yarn.resourcemanager.hostname</name>
 <value>namenode</value>
 </property>

mapred-site.xml
<property>
 <name>mapreduce.jobtracker.address</name>
 <value>namenode:54311</value>
 </property>
 <property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
 </property>

Step 5: NameNode specific configuration

  1. hdfs-site.xml
  2. masters
  3. slaves
hdfs-site.xml
 <property>
 <name>dfs.replication</name>
 <value>3</value>
 </property>
 <property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///home/daya/hadoop_data/hdfs/namenode</value>
 </property>
3 is the replication factors;since there are 3 datanodes. 

slaves
datanode1
datanode2
datanode3

masters
namenode

Step 6: DataNode Specific configuration

hdfs-site.xml
<name>dfs.replication</name>
 <value>3</value>
 </property>
 <property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///home/daya/hadoop_data/hdfs/datanode</value>
 </property>

Step 7: Format the NameNode

hdfs namnode -format

 

Once Successfully installed, status can be checked via the web interface:

http://10.0.0.190:50070/Screen Shot 2017-07-21 at 9.41.55 PM.png

 

Spark Cluster on the top of Hadoop Clusters.

We can utilize the same hadoop cluster to run the Spark clusters. Spark clusters is comparatively  easy to setup than hadoop counterpart. First we set the spark environmental variables and setup the paths to point to bin and sbin binaries of spark. It has already been ported in the .profile file mentioned above.

Start the spark master
cd /home/daya/spark-2.0.0/sbin 
./start-master.sh

Start the spark slave
./start-slave.sh spark://namenode:7077

Here namenode is the the spark master. run start-slave.sh in all the remaining data node. 

 

Spark clusters status can be checked via 8080 port

Screen Shot 2017-07-21 at 9.47.08 PM.png

To submit the jobs to spark reading data from the hadoop clusters:

spark-submit --master spark://namenode:7077 --class TAQExtract TAQExtract.jar hdfs://namenode:9000/finance-dataset/taqnbbo20150824 hdfs://namenode:9000/finance-dataset/taqnbbo20150824_result

Zeepelin

Web based notebook for interactive data analytics and collaborative documents with SQL, Scala and more.

Download the Zeeplin.

cd /home/daya/zeppelin-0.7.2-bin-all/conf
vi zeppelin-env.sh
export SPARK_HOME=/home/daya/spark-2.0.0
export JAVA_HOME=/home/daya/jdk1.7
./zeppelin-daemon.sh start

 

 

 

 

 

 

 

 

 

Since spark is working on 8080 you have to change the zeppline port to another port, also specify the IP address of the zeppline server in zeppelin-site.xml inside conf directory. 

<property>   <name>zeppelin.server.addr</name>   <value>10.0.0.190</value>   <description>Server address</description> </property> <property>   <name>zeppelin.server.port</name>   <value>8010</value>   <description>Server port.</description> </property>

Basic Troubleshooting:

NameNode and Datanode not starting problem:

It may be the case that sometimes datanode doesn’t start and sometimes name nodes doesn’t start.  For it you have to remove the  directory that were created in the /tmp directory. Also specify the new directory of hadoop in conf file and then

start-dfs.sh  && start-yarn.sh

 

Leave a Reply
Your email address will not be published. Required fields are marked *

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s