Month: April 2016

Installing Phoenix – A step by step tutorial

Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Prerequisites –

1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

2. Make sure you have installed Hbase on your machine, for that refer my post Hbase Installation in Pseudo-distributed mode.

Install Phoenix –

1. Download Phoenix-4.7.0-Hbase-1.1  and expand the installation tar.

tar -zxvf phoenix-4.7.0-Hbase-1.1-bin.tar.gz

2. Add the phoenix-[version]-server.jar to the classpath of HBase region server and master and remove any previous version. An easy way to do this is to copy it into the HBASE_INSTALL_DIR/lib directory.

3. Add the phoenix-[version]-client.jar to the Phoenix client.

4. Restart Hbase.

5. Run an example to test everything is working fine :

Open the Command Line – A terminal interface to execute SQL from the command line is now  bundled with Phoenix. To start it, execute the following from the bin directory:

$ sqlline.py localhost

And if Zookeeper is running externally then , in our case zookeeper is running on master node. Therefore run,

   $ sqlline.py <master-hostname>:2181

a. First, let’s create a us_population.sql file, containing a table definition:

CREATE TABLE IF NOT EXISTS us_population (
 state CHAR(2) NOT NULL,
 city VARCHAR NOT NULL,
 population BIGINT
 CONSTRAINT my_pk PRIMARY KEY (state, city));

b. Now let’s create a us_population.csv file containing some data to put in that table:

 NY,New York,8143197
 CA,Los Angeles,3844829
 IL,Chicago,2842518
 TX,Houston,2016582
 PA,Philadelphia,1463281
 AZ,Phoenix,1461575
 TX,San Antonio,1256509
 CA,San Diego,1255540
 TX,Dallas,1213825
 CA,San Jose,912332

c. And finally, let’s create a us_population_queries.sql file containing a query we’d like to run on that data.

SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
 FROM us_population
 GROUP BY state
 ORDER BY sum(population) DESC;

d. Loading Data : In addition, you can use the bin/psql.py to load CSV data or execute SQL     scripts. For example:

./psql.py <your_zookeeper_quorum> us_population.sql us_population.csv 
us_population_queries.sql

You have created a table in phoenix , inserted the data and run the query.

 

This is it 🙂

Advertisement

Hbase Installation in Pseudo-Distributed mode

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase on Hadoop File Systems, and ways to interact with HBase shell. It also describes how to connect to HBase using java, and how to perform basic operations on HBase using java.

Prerequisites :

1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

2. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.

 $ ssh-keygen -t rsa 
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
 $ chmod 0600 ~/.ssh/authorized_keys

3. Before we start configure HBase, you need to have a running Hadoop, which will be the storage for hbase(Hbase store data in Hadoop Distributed File System). Please refere to Hadoop-Yarn Installation in Pseudo-distributed mode post before continuing.

Installing And Configuring Hbase

1. Download the latest stable version of HBase form http://www.interior-dsgn.com/apache/hbase/stable/ using “wget” command, and extract it using the tar “zxvf” command. See the following command.

$ wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-1.1.4-bin.tar.gz
$ tar -zxvf hbase-1.1.4-bin.tar.gz

 

2. Go to <HBASE_HOME>/conf/hbase-env.sh

Export JAVA_HOME environment variable in hbase-env.sh file as shown below:

Export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67

Go to <HBASE_HOME>/conf/hbase-site.xml

 

Inside the hbase-site.xml file, you will find the <configuration> and </configuration> tags. Within them, set the HBase directory under the property key with the name “hbase.rootdir” as shown below.

<configuration>
   //Here you have to set the path where you want HBase to store its files.
   <property>
      <name>hbase.rootdir</name>
      <value>hdfs://localhost:9000/hbase</value>
   </property>
        
   //Here you have to set the path where you want HBase to store its built in zookeeper  files.
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/hadoop/zookeeper</value>
   </property>
   <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
   </property>
</configuration>

3. Starting HBase

After configuration is over, browse to HBase home folder and start HBase using the following command.

$bin/start-hbase.sh

4. Checking the HBase Directory in HDFS
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the following command.

 $ ./bin/hadoop fs -ls /hbase

If everything goes well, it will give you the following output.
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs

5. Run a sample example ,

Go to <HBASE_HOME> and run command,

$ bin/hbase shell

Create a table

Use the create command to create a new table. We must specify the table name and the ColumnFamily name:

hbase(main):001:0> create 'test', 'cf'
0 row(s) in 3.3340 seconds

=> Hbase::Table - test

Populating the data

Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in the case below:

hbase(main):008:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 1.3280 seconds

hbase(main):009:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0340 seconds

hbase(main):010:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0150 seconds
Scanning the table for all data at once

We can get data from HBase using scan. We can limit our scan, but for now, all data is fetched:

hbase(main):011:0> scan 'test'
ROW                               COLUMN+CELL                                                                                     
 row1                             column=cf:a, timestamp=1427820136323, value=value1                                              
 row2                             column=cf:b, timestamp=1427820144111, value=value2                                              
 row3                             column=cf:c, timestamp=1427820153067, value=value3                                              
3 row(s) in 0.1650 seconds
Get a single row of data –

To get a single row of data at a time, we can use the get command.

hbase(main):012:0> get 'test', 'row1'
COLUMN                            CELL                                                                                            
 cf:a                             timestamp=1427820136323, value=value1                                                           
1 row(s) in 0.0650 seconds

 

🙂

Hadoop-Yarn Installation in Pseudo-distributed mode

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Prerequisites :

 1 .Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

 2. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.

 $ ssh-keygen -t rsa 
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
 $ chmod 0600 ~/.ssh/authorized_keys 

Installation Steps:

3. Download hadoop-2.6.4.tar.gz from http://hadoop.apache.org/releases.html and extract to some path in your machine. Assuming that “impadmin” is the dedicated user for Hadoop.

4. Setup environment variables
Export below mentioned environment variables .

 JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 
 HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
 PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. 
 HADOOP_COMMON_HOME=$HADOOP_ PREFIX 
 HADOOP_HDFS_HOME=$HADOOP_ PREFIX 
 YARN_HOME=$HADOOP_ PREFIX 
 HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 
 YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 

For this demo we have modified “hadoop-env.sh” for exporting the variables.
You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export
these variables.

5. Create HDFS directories
Create two directories to be used by namenode and datanode.

Go to <HADOOP_PREFIX>,

 mkdir -p hdfs/namenode
 mkdir -p hdfs/datanode 
list folders ,
 ls -r hdfs
You will see - 
 namenode datanode 

6. Tweak config files
Go to etc/hadoop folder under HADOOP_PREFIX and add the following
properties under configuration tag in the files mentioned below: 

etc/hadoop/yarn-site.xml:

 <property> 
 <name>yarn.nodemanager.aux-services</name> 
 <value>mapreduce_shuffle</value> 
 </property> 
 <property> 
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
 <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
 </property> 

etc/hadoop/core-site.xml:

<property> 
 <name>fs.default.name</name> 
 <value>hdfs://localhost:9000 
 </property>

etc/hadoop/hdfs-site.xml:

<property> 
 <name>dfs.replication</name> 
 <value>1</value> 
 </property> 
 <property> 
 <name>dfs.namenode.name.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/namenode</value> 
 </property> 
 <property> 
 <name>dfs.datanode.data.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/datanode</value> 
 </property>

etc/hadoop/mapred-site.xml:
If this file does not exist, create it and paste the content provided below:

<?xml version="1.0"?>
 <configuration> 
 <property> 
 <name>mapreduce.framework.name</name> 
 <value>yarn</value> 
 </property> 
 </configuration>

7. Format namenode
This is one time activity.

$ bin/hadoop namenode -format 
 or 
 $ bin/hdfs namenode -format 

Once you have your data on HDFS DONOT run this command, doing so will
result in loss of content.

8.Run hadoop daemons

Start DFS daemons:
From <HADOOP-PREFIX> execute

 $ sbin/start-dfs.sh
 $ jps

you will see following processes running at this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode

Start YARN daemons:
From HADOOP_ PREFIX execute

 $ sbin/startyarn.sh
 $ jps

you will see following processesat this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode
 19312 NodeManager
 19091 ResourceManager

Note: you can also use start-all.sh and stop-all.sh for starting/stopping the daemons.

Start Job History Server:
From HADOOP_PREFIX execute

sbin/mr-jobhistory-daemon.sh start historyserver

9. Run sample and validate
Let’s run the wordcount sample to validate the setup.
Make an input file/directory.

$ mkdir input 
 $ cat > input/file 
 This is a sample file. 
 This is a sample line. 

Add this directory to HDFS:

$bin/hdfs dfs -copyFromLocal input /input 

Run example:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 
 3*.jar wordcount /input /output

To check the ouptput execute below command:

$ bin/hdfs dfs -cat /output/*

8. Web interface
We can browse HDFS and check health using http://localhost:50070 in the
browser.

 

Installation Completed 🙂