Installing Hadoop in fully distributed mode

Pre-requisite

  1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

  1. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of multi node setup machine should be able to passwordless ssh from/to all machines of cluster.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@slave

Installing And Configuring Hadoop

Assumptions –

For the purpose of clarity and ease of expression, I’ll be assuming that we are setting up a cluster of 2 nodes with IP Addresses

 10.10.10.1 – Namenode
 10.10.10.2 – Datanode
  1. Download hadoop-2.6.4 and extract the installation tar on all the nodes on the same path.Dedicated user for hadoop (We assume dedicated user is “impadmin”)

Make sure that master and all the slaves have the same user.

  1. Setup environment variables

Export environment variables as mentioned below for all nodes in the cluster.

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

For this demo we have modified “hadoop-env.sh” for exporting the variables.You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export these variables.

Add following lines at start of script in etc/hadoop/yarn-env.sh :

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:.
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

3. Create a folder for hadoop.tmp.dir
Create a temp folder in HADOOP_PREFIX

mkdir -p $HADOOP_PREFIX/tmp

4. Tweak config files
For all the machines in cluster, go to etc/hadoop folder under HADOOP_ PREFIX and add the following properties under configuration tag in the files mentioned below

etc/hadoop/core-site.xml –

<property>
<name>fs.default.name</name>
<value>hdfs://Master-Hostname:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/impadmin/hadoop-2.6.4/tmp</value>
</property>

etc/hadoop/hdfs-site.xml :

<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

etc/hadoop/mapred-site.xml :

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

etc/hadoop/yarn-site.xml :

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Master-Hostname:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Master-Hostname:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>Master-Hostname:8040</value>
</property>

Note: Make sure to replace “Master-Hostname” with your cluster’s master host name.

  1. Add slaves

Update HADOOP-PREFIX/etc/hadoop/slaves to add the slave entries on master machine.

Open “slaves” and enter hostname of all the slaves separated by line feed.

  1. Format namenode

This is one time activity. On master execute following command from HADOOP-PREFIX .

         $ bin/hadoop namenode -format or
         $ bin/hdfs namenode -format

Once you have your data on HDFS DONOT run this command, doing so will result in loss of content.

  1. Run hadoop daemons

From master execute below commands,Start DFS daemons:

From HADOOP-HOME execute

$sbin/start-dfs.sh
$jps
Processes which should run after starting master
NameNode
SecondaryNameNode
JPS

 Check on slave whether DFS daemons started or not :

$jps

Processes running on slaves is -
DataNode
JPS

Start YARN daemons:
From HADOOP_HOME execute

$sbin/startyarn.sh
$jps

Processes running on master - 
NameNode
SecondaryNode
ResourceManager
JPS

Check on slave whether DFS daemons started or not :

$jps

Processes running on slaves is -
DataNode
JPS
NodeManager
  1. Run sample and validate

Let’s run the wordcount sample to validate the setup. Make an input file/directory.

$ mkdir input
$ cat > input/file
This is a sample file.
This is a sample line.

    Add this directory to HDFS:

    $bin/hdfs dfs -copyFromLocal input /input2

 

Run example:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-
3*.jar wordcount /input /output

To check the ouptput execute below command:

   $ bin/hdfs dfs -cat /output/*

 

  1. Web interface

We can browse HDFS and check health using http://masterHostname:50070 in the browser.Also we can check the status of the applications running using the following

URL: http://masterHostname:9000

Done !!

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s