Pre-requisite
- Java JDK (This demo uses JDK version 1.7.0_67)
Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.
- SSH configured
Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of multi node setup machine should be able to passwordless ssh from/to all machines of cluster.
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys $ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@master $ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@slave
Installing And Configuring Hadoop
Assumptions –
For the purpose of clarity and ease of expression, I’ll be assuming that we are setting up a cluster of 2 nodes with IP Addresses
10.10.10.1 – Namenode 10.10.10.2 – Datanode
- Download hadoop-2.6.4 and extract the installation tar on all the nodes on the same path.Dedicated user for hadoop (We assume dedicated user is “impadmin”)
Make sure that master and all the slaves have the same user.
- Setup environment variables
Export environment variables as mentioned below for all nodes in the cluster.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4 export PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH export HADOOP_COMMON_HOME=$HADOOP_ PREFIX export HADOOP_HDFS_HOME=$HADOOP_ PREFIX export YARN_HOME=$HADOOP_ PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
For this demo we have modified “hadoop-env.sh” for exporting the variables.You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export these variables.
Add following lines at start of script in etc/hadoop/yarn-env.sh :
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4 export PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. export HADOOP_COMMON_HOME=$HADOOP_ PREFIX export HADOOP_HDFS_HOME=$HADOOP_ PREFIX export YARN_HOME=$HADOOP_ PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
3. Create a folder for hadoop.tmp.dir
Create a temp folder in HADOOP_PREFIX
mkdir -p $HADOOP_PREFIX/tmp
4. Tweak config files
For all the machines in cluster, go to etc/hadoop folder under HADOOP_ PREFIX and add the following properties under configuration tag in the files mentioned below
etc/hadoop/core-site.xml –
<property> <name>fs.default.name</name> <value>hdfs://Master-Hostname:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/impadmin/hadoop-2.6.4/tmp</value> </property>
etc/hadoop/hdfs-site.xml :
<property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property>
etc/hadoop/mapred-site.xml :
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
etc/hadoop/yarn-site.xml :
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>Master-Hostname:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>Master-Hostname:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>Master-Hostname:8040</value> </property>
Note: Make sure to replace “Master-Hostname” with your cluster’s master host name.
- Add slaves
Update HADOOP-PREFIX/etc/hadoop/slaves to add the slave entries on master machine.
Open “slaves” and enter hostname of all the slaves separated by line feed.
- Format namenode
This is one time activity. On master execute following command from HADOOP-PREFIX .
$ bin/hadoop namenode -format or $ bin/hdfs namenode -format
Once you have your data on HDFS DONOT run this command, doing so will result in loss of content.
- Run hadoop daemons
From master execute below commands,Start DFS daemons:
From HADOOP-HOME execute
$sbin/start-dfs.sh
$jps
Processes which should run after starting master
NameNode
SecondaryNameNode
JPS
Check on slave whether DFS daemons started or not :
$jps Processes running on slaves is - DataNode JPS
Start YARN daemons:
From HADOOP_HOME execute
$sbin/startyarn.sh $jps Processes running on master - NameNode SecondaryNode ResourceManager JPS
Check on slave whether DFS daemons started or not :
$jps Processes running on slaves is - DataNode JPS NodeManager
- Run sample and validate
Let’s run the wordcount sample to validate the setup. Make an input file/directory.
$ mkdir input $ cat > input/file This is a sample file. This is a sample line.
Add this directory to HDFS:
$bin/hdfs dfs -copyFromLocal input /input2
Run example:
$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 3*.jar wordcount /input /output
To check the ouptput execute below command:
$ bin/hdfs dfs -cat /output/*
- Web interface
We can browse HDFS and check health using http://masterHostname:50070 in the browser.Also we can check the status of the applications running using the following
URL: http://masterHostname:9000
Done !!