Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Prerequisites :
1 .Java JDK (This demo uses JDK version 1.7.0_67)
Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.
2. SSH configured
Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
Installation Steps:
3. Download hadoop-2.6.4.tar.gz from http://hadoop.apache.org/releases.html and extract to some path in your machine. Assuming that “impadmin” is the dedicated user for Hadoop.
4. Setup environment variables
Export below mentioned environment variables .
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4 PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. HADOOP_COMMON_HOME=$HADOOP_ PREFIX HADOOP_HDFS_HOME=$HADOOP_ PREFIX YARN_HOME=$HADOOP_ PREFIX HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
For this demo we have modified “hadoop-env.sh” for exporting the variables.
You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export
these variables.
5. Create HDFS directories
Create two directories to be used by namenode and datanode.
Go to <HADOOP_PREFIX>,
mkdir -p hdfs/namenode
mkdir -p hdfs/datanode
list folders ,
ls -r hdfs
You will see -
namenode datanode
6. Tweak config files
Go to etc/hadoop folder under HADOOP_PREFIX and add the following
properties under configuration tag in the files mentioned below:
etc/hadoop/yarn-site.xml:
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
etc/hadoop/core-site.xml:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000 </property>
etc/hadoop/hdfs-site.xml:
<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:<HADOOP_PREFIX>/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:<HADOOP_PREFIX>/hdfs/datanode</value> </property>
etc/hadoop/mapred-site.xml:
If this file does not exist, create it and paste the content provided below:
<?xml version="1.0"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
7. Format namenode
This is one time activity.
$ bin/hadoop namenode -format or $ bin/hdfs namenode -format
Once you have your data on HDFS DONOT run this command, doing so will
result in loss of content.
8.Run hadoop daemons
Start DFS daemons:
From <HADOOP-PREFIX> execute
$ sbin/start-dfs.sh $ jps
you will see following processes running at this point –
18831 SecondaryNameNode 18983 JPS 18343 NameNode 18563 DataNode
Start YARN daemons:
From HADOOP_ PREFIX execute
$ sbin/startyarn.sh $ jps
you will see following processesat this point –
18831 SecondaryNameNode 18983 JPS 18343 NameNode 18563 DataNode 19312 NodeManager 19091 ResourceManager
Note: you can also use start-all.sh and stop-all.sh for starting/stopping the daemons.
Start Job History Server:
From HADOOP_PREFIX execute
sbin/mr-jobhistory-daemon.sh start historyserver
9. Run sample and validate
Let’s run the wordcount sample to validate the setup.
Make an input file/directory.
$ mkdir input $ cat > input/file This is a sample file. This is a sample line.
Add this directory to HDFS:
$bin/hdfs dfs -copyFromLocal input /input
Run example:
$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 3*.jar wordcount /input /output
To check the ouptput execute below command:
$ bin/hdfs dfs -cat /output/*
8. Web interface
We can browse HDFS and check health using http://localhost:50070 in the
browser.
Installation Completed 🙂