Hadoop Cluster setup on Ubuntu 16.04
Its difficult to find proper documentation on how to setup the hadoop and spark cluster. Here I am going to mention precise and clear steps on setting the clusters.
Hadoop Clusters
Environment:
- NameNode (10.0.0.190)
- DataNode1 (10.0.0.191) 200 GB disk
- DataNode2 (10.0.0.192) 200 GB disk
- DataNode3 (10.0.0.193) 200 GB disk
Its better to provide the uniform storage across the DataNode1. Since replication is itself provided by the Hadoop clusters, it not recommended to setup the individual RAID partition in the DataNode.
Step 1: Configure passwordless ssh
Install Ubuntu 16.04 on all the machine and configure the passwordless ssh from NameNode to all of the Node. Also configure passwordless ssh from each node to other. It can simply be done by generating ssh private and public key pair using ssh-keygen command and using ssh-copy-id command to remotely copy the public key to the the machine where you want to ssh without using password.
ssh-keygen -t rsa
ssh-copy-id daya@10.0.0.191
Step 2: Configure host based resolution if DNS is not present
In each node keep the following to resolve the hostname.
127.0.0.1 localhost127.0.0.1 localhost 10.0.0.190 namenode 10.0.0.191 datanode1 10.0.0.192 datanode2 10.0.0.193 datanode3
Step 3: Download spark-2.0.0, Hadoop-2.7.3 and JDK1.7 binary source from their respective download repo. Extract them your home directory or any other directory of your choice.
Step 3: Manage the Environmental variables
In order to manage the environmental variable use .profile file and use these settings.
export JAVA_HOME=/home/daya/jdk1.7 export SPARK_HOME=/home/daya/spark-2.0.0 export HADOOP_HOME=/home/daya/hadoop-2.7.3 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin
Step 4: Hadoop configuration files in all Nodes
- hadoop-env.sh
- core-site.xml
- yarn-site.xml
- mapred-site.xml
hadoop-env.sh cd $HADOOP_CONF_DIR vi hadoop-env.sh export JAVA_HOME=/home/daya/jdk1.7 core-site.xml Inside the configuration tag. <property> <name>fs.defaultFS</name> <value>hdfs://namenode:9000</value> </property> yarn-site.xml Similar to core-site.xml put these settings inside configuration tag. <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>namenode</value> </property> mapred-site.xml <property> <name>mapreduce.jobtracker.address</name> <value>namenode:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Step 5: NameNode specific configuration
- hdfs-site.xml
- masters
- slaves
hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/daya/hadoop_data/hdfs/namenode</value> </property> 3 is the replication factors;since there are 3 datanodes. slaves datanode1 datanode2 datanode3 masters namenode
Step 6: DataNode Specific configuration
hdfs-site.xml <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/daya/hadoop_data/hdfs/datanode</value> </property>
Step 7: Format the NameNode
hdfs namnode -format
Once Successfully installed, status can be checked via the web interface:
Spark Cluster on the top of Hadoop Clusters.
We can utilize the same hadoop cluster to run the Spark clusters. Spark clusters is comparatively easy to setup than hadoop counterpart. First we set the spark environmental variables and setup the paths to point to bin and sbin binaries of spark. It has already been ported in the .profile file mentioned above.
Start the spark master cd /home/daya/spark-2.0.0/sbin ./start-master.sh Start the spark slave ./start-slave.sh spark://namenode:7077 Here namenode is the the spark master. run start-slave.sh in all the remaining data node.
Spark clusters status can be checked via 8080 port
To submit the jobs to spark reading data from the hadoop clusters:
spark-submit --master spark://namenode:7077 --class TAQExtract TAQExtract.jar hdfs://namenode:9000/finance-dataset/taqnbbo20150824 hdfs://namenode:9000/finance-dataset/taqnbbo20150824_result
Zeepelin
Web based notebook for interactive data analytics and collaborative documents with SQL, Scala and more.
Download the Zeeplin.
cd /home/daya/zeppelin-0.7.2-bin-all/conf vi zeppelin-env.sh export SPARK_HOME=/home/daya/spark-2.0.0 export JAVA_HOME=/home/daya/jdk1.7 ./zeppelin-daemon.sh start
Since spark is working on 8080 you have to change the zeppline port to another port, also specify the IP address of the zeppline server in zeppelin-site.xml inside conf directory.
<property> <name>zeppelin.server.addr</name> <value>10.0.0.190</value> <description>Server address</description> </property> <property> <name>zeppelin.server.port</name> <value>8010</value> <description>Server port.</description> </property>
Basic Troubleshooting:
NameNode and Datanode not starting problem:
It may be the case that sometimes datanode doesn’t start and sometimes name nodes doesn’t start. For it you have to remove the directory that were created in the /tmp directory. Also specify the new directory of hadoop in conf file and then
start-dfs.sh && start-yarn.sh