Running a Multi-Node Hadoop Cluster

Saeid Dadkhah

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers. Setting up a Hadoop cluster is not that hard but there is not a cohesive document, so I decided to write an article about it. Let’s BEGIN…

Let’s assume that we have three Linux servers, one master and the others worker1 and worker2. I may use some CentOS (RHEL) commands. We will go through the following steps to run the cluster.

Prepare Servers
Install JDK
Config HDFS and YARN
Start HDFS and YARN daemons

Prepare Servers

As HDFS and YARN use hostname to communicate, we should set the hostname for all servers, and add them to 192.168.0.1, 192.168.0.2, and 192.168.0.3. SSH each server and copy these lines to /etc/hosts file.


[root@192.168.0.x ~]# vi /etc/hosts


# Add these lines:
192.168.0.1 master
192.168.0.2 worker1
192.168.0.3 worker

I will use this notation [{user}@{server} {working directory}] to specify the user, server, and working directory on which the commands should be run, so check them before running scripts and commands. Now, it is time to set the hostname. SSH each server, and set the corresponding hostname using the following commands, and reboot them:


[root@192.168.0.x ~]# hostnamectl set-hostname {name} # name: master, worker1, worker2 
[root@192.168.0.x ~]# hostnamectl # Should print {name} 
[root@192.168.0.x ~]# reboot

Create a user named hduser in hadoop group, make it sudoer, and set a password for it, using these commands:


[root@{hostname} ~]# groupadd hadoop 
[root@{hostname} ~]# useradd -m hduser 
[root@{hostname} ~]# usermod -aG hadoop hduser 
[root@{hostname} ~]# usermod -aG wheel hduser 
[root@{hostname} ~]# passwd hduser

hduser on master should be able to connect to itself and the other servers with passwordless ssh. We should create an ssh key for hduser on the master server, and copy it to all servers including itself:


[hduser@master ~]# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa 
[hduser@master ~]# ssh-copy-id -i .ssh/id_rsa.pub hduser@{master and workers}

You can verify that after running the latest commands, we are allowed to ssh all servers with hduser without providing the password. After these simple steps, the servers are ready, and we can go to the next step.

Install Java

Before using Apache Hadoop, we should install Java, so download JDK or JRE (I prefer JDK), and copy it to workers.


[hduser@master ~]# wget {JDK LINK (I downloaded jdk-8u291-linux-x64.tar.gz)}
[hduser@master ~]# scp jdk-8u291-linux-x64.tar.gz hduser@worker1:/home/hduser/jdk-8u291-linux-x64.tar.gz
[hduser@master ~]# scp jdk-8u291-linux-x64.tar.gz hduser@worker2:/home/hduser/jdk-8u291-linux-x64.tar.gz

On all servers, extract JDK, and create a soft link to be able to change its version simply in the future. I will extract all files in the /opt directory.


[hduser@{server} ~]# cd /opt 
[hduser@{server} /opt]# sudo tar xzf /home/hduser/jdk-8u291-linux-x64.tar.gz 
[hduser@{server} /opt]# sudo ln -s jdk1.8.0_291/ jdk 
[hduser@{server} /opt]# sudo chown -R hduser:hadoop jdk [hduser@{server} /opt]# sudo chown -R hduser:hadoop jdk1.8.0_291 [hduser@{server} /opt]# sudo update-alternatives --install /usr/bin/java java /opt/jdk/bin/java 100 
[hduser@{server} /opt]# sudo update-alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 100 [hduser@{server} /opt]# # Check java
is installed successfully 
[hduser@{server} /opt]# update-alternatives --display java 
[hduser@{server} /opt]# update-alternatives --display javac 
[hduser@{server} /opt]# java -version

Config HDFS and Yarn

Currently, we have our servers set up and java! Let’s download Apache Hadoop, copy it to all servers, create links, and everything.


[hduser@master ~]# wget {Apache Hadoop LINK (I downloaded hadoop-3.3.0-aarch64.tar.gz)}
[hduser@master ~]# scp hadoop-3.3.0-aarch64.tar.gz hduser@worker1:/home/hduser/hadoop-3.3.0-aarch64.tar.gz
[hduser@master ~]# scp hadoop-3.3.0-aarch64.tar.gz hduser@worker2:/home/hduser/hadoop-3.3.0-aarch64.tar.gz
[hduser@{server} ~]# cd /opt
[hduser@{server} /opt]# sudo tar xzf /home/hduser/hadoop-3.3.0-aarch64.tar.gz
[hduser@{server} /opt]# sudo ln -s hadoop-3.3.0/ hadoop
[hduser@{server} /opt]# sudo chown -R hduser:hadoop hadoop
[hduser@{server} /opt]# sudo chown -R hduser:hadoop hadoop-3.3.0

On all servers, create a directory, and store all data there. I will create this directory under /data directory, so if you’ve chosen another directory, don’t forget to change the scripts. We will create four directories

data: store data related to DataNode
name: store data related to NameNode
pid: create pid file
tmp: allow Apache Hadoop to create temp files


/opt/hadoop/etc/hadoop
[hduser@{server} /data]# sudo mkdir hadoop
[hduser@{server} /data]# sudo chown -R hduser:hadoop hadoop
[hduser@{server} /data/hadoop]# cd hadoop
[hduser@{server} /data/hadoop]# mkdir data
[hduser@{server} /data/hadoop]# mkdir name
[hduser@{server} /data/hadoop]# mkdir pid
[hduser@{server} /data/hadoop]# mkdir tmp

All Hadoop config files are available under /opt/hadoop/etc/hadoop directory and most of the configs should be the same among all servers, so we will modify them one by one on the master, then copy them to the workers, and make some modifications on workers.


[hduser@master ~]# cd /opt/hadoop/etc/hadoop
[hduser@master /opt/hadoop/etc/hadoop]# vi core-site.xml
# Add these elements:
## Do not forget to change the tmp directory
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://{master}:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop/tmp</value>
  </property>
[hduser@master /opt/hadoop/etc/hadoop]# vi hadoop-env.sh
# Add these commands:
## Do not forget to change the PID directory
export JAVA_HOME=/opt/jdk
export HADOOP_PID_DIR=/data/hadoop/pid

[hduser@master /opt/hadoop/etc/hadoop]# vi hdfs-site.xml
# Add these elements:
## dfs.replication property specifies replication factor, you may want to change it. I prefer replication factor of two for small clusters.
## Do not forget to change the data directory
## Do not forget to change the name directory
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/data/hadoop/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/data/hadoop/name</value>
  </property>
  <property>
    <name>dfs.permission</name>
    <value>false</value>
  </property>
[hduser@master /opt/hadoop/etc/hadoop]# vi mapred-site.xml
# Add these elements:
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.application.classpath</name>
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>master:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at.  If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
  </property>
[hduser@master /opt/hadoop/etc/hadoop]# vi workers

# Add the names of all workers
master
worker1
worker2
[hduser@master /opt/hadoop/etc/hadoop]# vi yarn-site.xml

# Add these lines
## You may want to change the max-disk-utilization-per-disk-percentage.
## Notice that the value of "yarn.nodemanager.hostname" property is {worker}. We leave it for now, and will change it later
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
  </property>
  <property>
    <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
    <value>95</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
  <property>
    <name>yarn.nodemanager.hostname</name>
    <value>{worker}</value>
  </property>

OK! Done! Now, we will copy all config files to all workers then change some of them.


[hduser@master /opt/hadoop/etc/hadoop]# scp core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml yarn-site.xml hduser@{workers}:/opt/hadoop/etc/hadoop/
[hduser@{server} /opt/hadoop/etc/hadoop]# vi yarn-site.xml
# Change the value of "yarn.nodemanager.hostname" to the name of server.
## On master: <value>master</value>
## On worker1: <value>worker1</value>

It is time to set environment variables on all servers. I recommend doing this on master and copy it to the others.


[hduser@{server} ~]# vi .bashrc
# Add these commands
# Java
export JAVA_HOME=/opt/jdk
# Haddop & YARN
export PDSH_RCMD_TYPE=ssh
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=/opt/hadoop
export HADOOP_HDFS_HOME=/opt/hadoop
export HADOOP_YARN_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
export HDFS_NAMENODE_USER="hduser"
export HDFS_DATANODE_USER="hduser"
export HDFS_SECONDARYNAMENODE_USER="hduser"
export YARN_RESOURCEMANAGER_USER="hduser"
export YARN_NODEMANAGER_USER="hduser"
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native

Start HDFS and YARN daemons

Downloaded, shared files, and configured all servers! Now, it is time to start the cluster. We should format the NameNode then start HDFS and YARN. If you want to start HDFS and YARN, start HDFS and then YARN, and when you want to stop them, stop YARN first and then HDFS.


[hduser@master ~]# hadoop namenode -format
[hduser@master ~]# /opt/hadoop/sbin/start-dfs.sh
[hduser@master ~]# /opt/hadoop/sbin/start-yarn.sh
[hduser@master ~]# /opt/hadoop/sbin/stop-yarn.sh
[hduser@master ~]# /opt/hadoop/sbin/stop-dfs.sh

You can also create services to control HDFS and YARN.


[hduser@master ~]# cd /etc/systemd/system
[hduser@master /etc/systemd/system]# vi hdfs.service
# Add these lines
## Do not forget to change the PID directory
[Unit]
Description=Hadoop DFS namenode and datanode
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target

[Service]
User=hduser
Group=hadoop
Type=simple
ExecStart=/opt/hadoop/sbin/start-dfs.sh
ExecStop=/opt/hadoop/sbin/stop-dfs.sh
WorkingDirectory=/home/hduser
TimeoutStartSec=2min
Restart=on-failure
PIDFile=/data/hadoop/pid/hadoop-hduser-namenode.pid

[Install]
WantedBy=multi-user.target

[hduser@master /etc/systemd/system]# vi yarn.service

# Add these lines
## Do not forget to change the PID directory
[Unit]
Description=YARN resourcemanager and nodemanagers
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target

[Service]
User=hduser
Group=hadoop
Type=simple
ExecStart=/opt/hadoop/sbin/start-yarn.sh
ExecStop=/opt/hadoop/sbin/stop-yarn.sh
WorkingDirectory=/home/hduser
TimeoutStartSec=2min
Restart=on-failure
PIDFile=/data/hadoop/pid/hadoop-hduser-resourcemanager.pid

[Install]
WantedBy=multi-user.target

I hope this tutorial was useful. Thank You.

Author: Saeid Dadkhah

Technical Articles

One thought on “Running a Multi-Node Hadoop Cluster”

Pingback: Setting up an Apache Hive Data Warehouse - Adanic