Category: Uncategorized

Hadoop-Yarn Installation in Pseudo-distributed mode

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Prerequisites :

 1 .Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

 2. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.

 $ ssh-keygen -t rsa 
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
 $ chmod 0600 ~/.ssh/authorized_keys 

Installation Steps:

3. Download hadoop-2.6.4.tar.gz from http://hadoop.apache.org/releases.html and extract to some path in your machine. Assuming that “impadmin” is the dedicated user for Hadoop.

4. Setup environment variables
Export below mentioned environment variables .

 JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 
 HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
 PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. 
 HADOOP_COMMON_HOME=$HADOOP_ PREFIX 
 HADOOP_HDFS_HOME=$HADOOP_ PREFIX 
 YARN_HOME=$HADOOP_ PREFIX 
 HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 
 YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 

For this demo we have modified “hadoop-env.sh” for exporting the variables.
You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export
these variables.

5. Create HDFS directories
Create two directories to be used by namenode and datanode.

Go to <HADOOP_PREFIX>,

 mkdir -p hdfs/namenode
 mkdir -p hdfs/datanode 
list folders ,
 ls -r hdfs
You will see - 
 namenode datanode 

6. Tweak config files
Go to etc/hadoop folder under HADOOP_PREFIX and add the following
properties under configuration tag in the files mentioned below: 

etc/hadoop/yarn-site.xml:

 <property> 
 <name>yarn.nodemanager.aux-services</name> 
 <value>mapreduce_shuffle</value> 
 </property> 
 <property> 
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
 <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
 </property> 

etc/hadoop/core-site.xml:

<property> 
 <name>fs.default.name</name> 
 <value>hdfs://localhost:9000 
 </property>

etc/hadoop/hdfs-site.xml:

<property> 
 <name>dfs.replication</name> 
 <value>1</value> 
 </property> 
 <property> 
 <name>dfs.namenode.name.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/namenode</value> 
 </property> 
 <property> 
 <name>dfs.datanode.data.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/datanode</value> 
 </property>

etc/hadoop/mapred-site.xml:
If this file does not exist, create it and paste the content provided below:

<?xml version="1.0"?>
 <configuration> 
 <property> 
 <name>mapreduce.framework.name</name> 
 <value>yarn</value> 
 </property> 
 </configuration>

7. Format namenode
This is one time activity.

$ bin/hadoop namenode -format 
 or 
 $ bin/hdfs namenode -format 

Once you have your data on HDFS DONOT run this command, doing so will
result in loss of content.

8.Run hadoop daemons

Start DFS daemons:
From <HADOOP-PREFIX> execute

 $ sbin/start-dfs.sh
 $ jps

you will see following processes running at this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode

Start YARN daemons:
From HADOOP_ PREFIX execute

 $ sbin/startyarn.sh
 $ jps

you will see following processesat this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode
 19312 NodeManager
 19091 ResourceManager

Note: you can also use start-all.sh and stop-all.sh for starting/stopping the daemons.

Start Job History Server:
From HADOOP_PREFIX execute

sbin/mr-jobhistory-daemon.sh start historyserver

9. Run sample and validate
Let’s run the wordcount sample to validate the setup.
Make an input file/directory.

$ mkdir input 
 $ cat > input/file 
 This is a sample file. 
 This is a sample line. 

Add this directory to HDFS:

$bin/hdfs dfs -copyFromLocal input /input 

Run example:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 
 3*.jar wordcount /input /output

To check the ouptput execute below command:

$ bin/hdfs dfs -cat /output/*

8. Web interface
We can browse HDFS and check health using http://localhost:50070 in the
browser.

 

Installation Completed 🙂

Advertisement

Installing Apache Spark in local mode on Windows 8

In this post i will walk through the process of downloading and running Apache Spark on Windows 8 X64 in local mode on a single computer.

Prerequisites

  1. Java Development Kit (JDK either 7 or 8) ( I installed it on this path ‘C:\Program Files\Java\jdk1.7.0_67’).
  2. Scala 2.11.7 ( I installed it on this path ‘C:\Program Files (x86)\scala’ . This is optional).
  3. After installation, we need to set the following environment variables:
    1. JAVA_HOME , the value is JDK path.
      In my case it will be ‘C:\Program Files\Java\jdk1.7.0_67’. for more details click here.
      Then append it to PATH environment variable as ‘%JAVA_HOME%\bin’ .
    2. SCALA_HOME,
      In my case it will be  ‘C:\Program Files (x86)\scala’.
      Then append it to PATH environment variable as ‘%SCALA_HOME%\bin’ .

Downloading and installing Spark

  1. It is easy to follow the instructions on http://spark.apache.org/docs/latest/ and download Spark 1.6.0 (Jan 04 2016) with the “Pre-build for Hadoop 2.6 and later” package type from http://spark.apache.org/downloads.html

spark1

2. Extract the zipped file to D:\Spark.

3. Spark has two shells, they are existed in ‘C:\Spark\bin\’ directory :

       a. Scala shell (C:\Spark\bin\spark-shell.cmd).
b .Python shell (C:\Spark\bin\pyspark.cmd).

4. You can run of one them, and you will see the following exception:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.This issue is often caused by a missing winutils.exe file that Spark needs in order to initialize the Hive context, which in turn depends onHadoop, which requires native libraries on Windows to work properly. Unfortunately, this happens even if you are using Spark in local mode without utilizing any of the HDFS features directly.

issue-screen

To resolve this problem, you need to:

a. download the 64-bit winutils.exe (106KB)

b. copy the downloaded file winutils.exe into a folder like D:\hadoop\bin (or                     D:\spark\hadoop\bin)

c. set the environment variable HADOOP_HOME to point to the above directory but without \bin. For example:

  • if you copied the winutils.exe to D:\hadoop\bin, set HADOOP_HOME=D:\hadoop
  • if you copied the winutils.exe to D:\spark\hadoop\bin, set HADOOP_HOME=D:\spark\hadoop

d. Double-check that the environment variable HADOOP_HOME is set properly by         opening the Command Prompt and running echo %HADOOP_HOME%

e. You will also notice that when starting the spark-shell.cmd, Hive will create a C:\tmp\hive folder. If you receive any errors related to permissions of this folder, use the following commands to set that permissions on that folder:

  • List current permissions: %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive
  • Set permissions: %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive
  • List updated permissions: %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive

5. Re-run spark-shell,it should work as expected.

Text search sample

program-spark

Hope that will help !