Most frequently used Hadoop commands

Commands useful for users of a hadoop.

1. appendToFile

Usage: hdfs dfs -appendToFile <localsrc> … <dst>

Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs -appendToFile – hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and 1 on error.

2. cat

Usage: hdfs dfs -cat URI [URI …]

Copies source paths to stdout.

Example:

hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hdfs dfs -cat file:///file3 /user/hadoop/file4

Exit Code:

Returns 0 on success and -1 on error.

3. chmod

Usage: hdfs dfs -chmod [-R] <MODE[,MODE]… | OCTALMODE> URI [URI …]

Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions Guide.

Options

The -R option will make the change recursively through the directory structure.

4. chown

Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide.

Options

The -R option will make the change recursively through the directory structure.

5. copyFromLocal

Usage: hdfs dfs -copyFromLocal <localsrc> URI

Similar to put command, except that the source is restricted to a local file reference.

Options:

The -f option will overwrite the destination if it already exists.

6. copyToLocal

Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

7. count

Usage: hdfs dfs -count [-q] [-h] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE FILE_NAME

The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME

The -h option shows sizes in human readable format.

Example:

hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hdfs dfs -count -q hdfs://nn1.example.com/file1
hdfs dfs -count -q -h hdfs://nn1.example.com/file1

Exit Code:

Returns 0 on success and -1 on error.

8. cp

Usage: hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI …] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.

‘raw.*’ namespace extended attributes are preserved if (1) the source and destination filesystems support them (HDFS only), and (2) all source and destination pathnames are in the /.reserved/raw hierarchy. Determination of whether raw.* namespace xattrs are preserved is independent of the -p (preserve) flag.

Options:

The -f option will overwrite the destination if it already exists.
The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.

Example:

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

9. du

Usage: hdfs dfs -du [-s] [-h] URI [URI …]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Options:

The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)

Example:

hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

Exit Code: Returns 0 on success and -1 on error.

10. get

Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.

Example:

hdfs dfs -get /user/hadoop/file localfile
hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile

Exit Code:

Returns 0 on success and -1 on error.

11. ls

Usage: hdfs dfs -ls [-R] <args>

Options:

The -R option will return stat recursively through the directory structure.

For a file returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Example:

hdfs dfs -ls /user/hadoop/file1

Exit Code:

Returns 0 on success and -1 on error.

12. lsr

Usage: hdfs dfs -lsr <args>

Recursive version of ls.

Note: This command is deprecated. Instead use hdfs dfs -ls -R

13. mkdir

Usage: hdfs dfs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

13. moveFromLocal

Usage: hdfs dfs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it’s copied.

14. moveToLocal

Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>

Displays a “Not implemented yet” message.

15. mv

Usage: hdfs dfs -mv URI [URI …] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

Exit Code:

Returns 0 on success and -1 on error.

16. put

Usage: hdfs dfs -put <localsrc> … <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.

hdfs dfs -put localfile /user/hadoop/hadoopfile
hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
hdfs dfs -put – hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and -1 on error.

17. rm

Usage: hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI …]

Delete files specified as args.

Options:

The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
The -R option deletes the directory and any content under it recursively.
The -r option is equivalent to -R.
The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.

Example:

hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

Exit Code:

Returns 0 on success and -1 on error.

18. rmr

Usage: hdfs dfs -rmr [-skipTrash] URI [URI …]

Recursive version of delete.

Note: This command is deprecated. Instead use hdfs dfs -rm -r

19. text

Usage: hdfs dfs -text <src>

Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

20. touchz

Usage: hdfs dfs -touchz URI [URI …]

Create a file of zero length.

Example:

hdfs dfs -touchz pathname

Exit Code: Returns 0 on success and -1 on error.

 

Thanks ūüôā

 

Advertisement

SolrCloud Setup on single machine

SolrCloud is the name of a set of new distributed capabilities in Solr. Passing parameters to enable these capabilities will enable you to set up a highly available, fault tolerant cluster of Solr servers. Use SolrCloud when you want high scale, fault tolerant, distributed indexing and search capabilities.

A little about SolrCores and Collections

On a single instance, Solr has something called a SolrCore that is essentially a single index. If you want multiple indexes, you create multiple SolrCores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made up of multiple SolrCore‘s on different machines. We call all of these SolrCores that make up one logical index a collection. A collection is a essentially a single index that spans many SolrCore‘s, both for index scaling as well as redundancy. If you wanted to move your 2 SolrCore Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual SolrCores.

Steps to install SolrCloud :

  1. Download solr-4.10.0 from http://lucene.apache.org/solr/downloads.html and unzip it.
  2. The process of creating a cluster consisting of two solr servers representing two different shards of a collection :

  1. Since we’ll need two solr servers for this, simply make a copy of the unzip solr folder for the second server — making sure you don’t have any data already indexed.

¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† In command prompt go to example folder and then ‚Äď

¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† cp ‚Äďr solr4.10.0 solr2 
  1. Go to example folder of first solr1 server in command prompt.
              cd example
  1. Now enter the command starts up a Solr server and bootstraps a new solr cluster.
 java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun  -DnumShards=2 -jar start.jar 

-Dbootstrap_confdir : path to configuration directory. Can be set according to location of the conf   folder.
-Dcollection.configName : name of conf folder on zookeeper.
-DnumShards : number of shards.

  1. Browse to http://localhost:8983/solr/#/~cloud to see the state of the cluster
  2. Then start the second server, pointing it at the cluster 

Go to example folder of solr2 server:

cd example2
java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
  1. You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: http://localhost:8983/solr/#/~cloud
  1. To check health of cluster :
    solr healthcheck -c collection name

 

Spring Batch Easy Example – from csv to csv file

Batch processing is the execution of a series of programs (‚Äújobs‚ÄĚ) on a computer without manual intervention.

Spring Batch provides mechanisms for processing large amount of data like transaction management, job processing, resource management, logging, tracing, conversion of data, interfaces, etc.
These functionalities are available out of the box and can be reused by applications containing the Spring Batch framework.

In this tutorial, we will show you how to configure a Spring Batch job to read CSV file into a CSV file, and filter out the record before writing with ItemProcessor. Its a very easy program for beginners.

Tools and libraries used

  1. Maven 3
  2. Eclipse Luna
  3. JDK 1.7
  4. Spring Core 3.2.2.RELEASE
  5. Spring Batch 2.2.0.RELEASE
  6. Spring OXM 3.2.2.RELEASE

1. Create a maven project . I named my project as SpringBatchProject.

2. Project Dependencies –

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.solution.springbatch</groupId>
  <artifactId>SpringBatchProject</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <url>http://maven.apache.org</url>
  
  <properties>
        <jdk.version>1.7</jdk.version>
        <spring.version>3.2.2.RELEASE</spring.version>
        <spring.batch.version>2.2.0.RELEASE</spring.batch.version>
         <quartz.version>2.2.1</quartz.version>
    </properties>
    
    <dependencies>

        <!-- Spring Core --> 
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>${spring.version}</version>
        </dependency>

        <!-- Spring XML to/back object -->
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-oxm</artifactId>
            <version>${spring.version}</version>
        </dependency>
        <!-- Spring Batch dependencies -->
        <dependency>
            <groupId>org.springframework.batch</groupId>
            <artifactId>spring-batch-core</artifactId>
            <version>${spring.batch.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.batch</groupId>
            <artifactId>spring-batch-infrastructure</artifactId>
            <version>${spring.batch.version}</version>
        </dependency>

        <!-- Spring Batch unit test -->
        <dependency>
            <groupId>org.springframework.batch</groupId>
            <artifactId>spring-batch-test</artifactId>
            <version>${spring.batch.version}</version>
        </dependency>
     <dependency>
            <groupId>org.quartz-scheduler</groupId>
            <artifactId>quartz</artifactId>
            <version>${quartz.version}</version>
        </dependency>
        <!-- Junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>

    </dependencies>
    <build>
        <finalName>spring-batch</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-eclipse-plugin</artifactId>
                <version>2.9</version>
                <configuration>
                    <downloadSources>true</downloadSources>
                    <downloadJavadocs>false</downloadJavadocs>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>${jdk.version}</source>
                    <target>${jdk.version}</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

3. Project Structure –

project_structure

4. CSV file resources/files/input.csv

1001,iryna,31,31/08/1982,200000
1003,john,29,21/08/1984,1000000
1004,brett,29,21/03/1984,80000.89
1002,jane,30,21/04/1992,500000
1005,anee,27,14/06/1992,500000

5. Read CSV file resources/jobs/job-report.xml

<!-- read csv file-->

<bean id="cvsFileItemReader" class="org.springframework.batch.item.file.FlatFileItemReader">
        <property name="resource" value="classpath:files/input.csv" />
             <property name="lineMapper">
            <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
              <!-- split it -->
              <property name="lineTokenizer">
                    <bean
                  class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                    <property name="names" value="refId, name, age, csvDob, income" />
                </bean>
              </property>
              <property name="fieldSetMapper">   


<!-- map with Report bean -->

             <bean
                class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
                <property name="prototypeBeanName" value="report" />
              </bean>      
              </property>

              </bean>
          </property>
    </bean>

6. The csv file mapped to Pojo Report.java

package com.solution.model;

import java.math.BigDecimal;
import java.text.SimpleDateFormat;
import java.util.Date;

public class Report {

    private int refId;
    private String name;
    private int age;
    private Date dob;
    private BigDecimal income;
    
    
    public int getRefId() {
        return refId;
    }

    public void setRefId(int refId) {
        this.refId = refId;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }

    public Date getDob() {
        return dob;
    }

    public void setDob(Date dob) {
        this.dob = dob;
    }

    public BigDecimal getIncome() {
        return income;
    }

    public void setIncome(BigDecimal income) {
        this.income = income;
    }
    
    public String getCsvDob() {

        SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MM/yyyy");
        return dateFormat.format(getDob());
      }
}

7. Spring batch Core Settings

Define jobRepository and jobLauncher

resources/config/context.xml
<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="
	http://www.springframework.org/schema/beans 
	http://www.springframework.org/schema/beans/spring-beans-3.2.xsd">

    <!-- stored job-meta in memory --> 
    <bean id="jobRepository"
	class="org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean">
	<property name="transactionManager" ref="transactionManager" />
    </bean>
 	
    <bean id="transactionManager"
	class="org.springframework.batch.support.transaction.ResourcelessTransactionManager" />
	
 
    <bean id="jobLauncher"
	class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
	<property name="jobRepository" ref="jobRepository" />
    </bean>

</beans>

8. Spring batch Jobs

A Spring batch job, read the report.csvfile, map it to Report object, and write it into a csv file.

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:batch="http://www.springframework.org/schema/batch" xmlns:task="http://www.springframework.org/schema/task"
    xmlns:context="http://www.springframework.org/schema/context"
    xmlns:util="http://www.springframework.org/schema/util" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/batch
        http://www.springframework.org/schema/batch/spring-batch-2.2.xsd
        http://www.springframework.org/schema/beans 
        http://www.springframework.org/schema/beans/spring-beans-3.2.xsd
        http://www.springframework.org/schema/util 
        http://www.springframework.org/schema/util/spring-util-3.2.xsd
        http://www.springframework.org/schema/task
        http://www.springframework.org/schema/task/spring-task-3.2.xsd
        http://www.springframework.org/schema/context
        http://www.springframework.org/schema/context/spring-context.xsd">
    
    <context:component-scan base-package="com.solution.scheduler" />
    
    <bean id="report" class="com.solution.model.Report" scope="prototype" />
    <batch:job id="reportJob" restartable="true">
        <batch:step id="step1">
            <batch:tasklet>
                <batch:chunk reader="cvsFileItemReader" writer="cvsFileItemWriter" processor="filterReportProcessor"
                    commit-interval="1">
                </batch:chunk>
            </batch:tasklet>
        </batch:step>
    </batch:job>

    <bean id="filterReportProcessor" class="com.solution.processor.FilterReportProcessor" />

    <bean id="cvsFileItemReader" class="org.springframework.batch.item.file.FlatFileItemReader">
        <property name="resource" value="classpath:files/input.csv" />
             <property name="lineMapper">
            <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
              <!-- split it -->
              <property name="lineTokenizer">
                    <bean
                  class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                    <property name="names" value="refId, name, age, csvDob, income" />
                </bean>
              </property>
              <property name="fieldSetMapper">   


                <bean
                class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
                <property name="prototypeBeanName" value="report" />
              </bean>      
              </property>

              </bean>
          </property>
    </bean>


    <bean id="cvsFileItemWriter" class="org.springframework.batch.item.file.FlatFileItemWriter">

        <!-- write to this csv file -->
        <property name="resource" value="file:csv/report.csv" />
        <property name="shouldDeleteIfExists" value="true" />

        <property name="lineAggregator">
            <bean
                class="org.springframework.batch.item.file.transform.DelimitedLineAggregator">
                <property name="delimiter" value="," />
                <property name="fieldExtractor">
                    <bean
                        class="org.springframework.batch.item.file.transform.BeanWrapperFieldExtractor">
                        <property name="names" value="refId, name, age, csvDob, income" />
                    </bean>
                </property>
            </bean>
        </property>

    </bean>
    <bean id="runScheduler" class="com.solution.scheduler.RunScheduler" />
 <!-- Run every 5 seconds -->
  <task:scheduled-tasks>
  
    <task:scheduled ref="runScheduler" method="run" cron="*/5 * * * * *" />
   </task:scheduled-tasks>
</beans>

9. Spring Batch – ItemProcessor

In Spring batch, the wired Processor will be fired before writing to any resources, so, this is the best place to handle any conversion, filtering and business logic. In this example, the Report object will be ignored (not write to csv file) if its’ age is greater than equal to 30.

package com.solution.processor;

import org.springframework.batch.item.ItemProcessor;

import com.solution.model.Report;


//run before writing
public class FilterReportProcessor implements ItemProcessor<Report, Report> {

    @Override
    public Report process(Report item) throws Exception {

        //filter object which age > 30
        if(item.getAge()>30){
            return null; // null = ignore this object
        }
        return item;
    }

}

10. I have scheduled this process which will run in every 5 seconds through cron jobs.

RunScheduler.java

package com.solution.scheduler;

import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobExecution;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.JobParametersBuilder;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

@Component
public class RunScheduler {

  @Autowired
  private JobLauncher jobLauncher;

  @Autowired
  private Job job;

  public void run() {

      try {
          JobParameters jobParameters = 
                  new JobParametersBuilder()
                  .addLong("time",System.currentTimeMillis()).toJobParameters();
            JobExecution execution = jobLauncher.run(job, jobParameters);
            System.out.println("Exit Status : " + execution.getStatus());

        } catch (Exception e) {
            e.printStackTrace();
        }

  }
}

11. Run the Main class now

package com.solution.scheduler;

import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class App {
    public static void main(String[] args) {

        String[] springConfig  = 
            {    
                "config/context.xml",
                "jobs/job-report.xml" 
            };
        
        ApplicationContext context = 
                new ClassPathXmlApplicationContext(springConfig);
        
    

    }
}

12. output csv file i.e. report.csv

1003,john,29,08/09/1985,1000000
1004,brett,29,03/09/1985,80000.89
1002,jane,30,04/09/1993,500000
1005,anee,27,06/02/1993,500000

 

 

 

Installing Hbase in fully distributed mode

Pre-requisite

1. Java JDK (This demo uses JDK version 1.7.0_67)
Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

2. Make sure you have installed Hadoop on your cluster, please refer my post to install the same Installing Hadoop in fully Distributed mode

Installing And Configuring Hbase

Assumptions –
For the purpose of clarity and ease of expression, I’ll be assuming that we are setting up a cluster of 2 nodes with IP Addresses

10.10.10.1 - Hmaster
10.10.10.2 ‚Äď HRegion Server

And in my case Hmaster is also a NameNode and Region Server is DataNode.

1. Download hbase-1.1.4-bin.tar.gz from http://www.apache.org/dyn/closer.cgi/hbase/ and extract to some path in your computer. Now I am calling hbase installation root as $HBASE_INSTALL_DIR.

2. Edit the file /etc/hosts on the master machine and add the following lines.

10.10.10.1 master
10.10.10.2 slave

¬†Note: Run the command ‚Äúping master‚ÄĚ. This command is run to check whether the master machine ip is being resolved to actual ip not localhost ip.

3. As we are using hadoop installed machines so we have already setup passwordless-ssh.

4. Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and set the $JAVA_HOME.

     export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67

5. Configure Hbase

Case I- When HBase manages the Zookeeper ensemble

Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh  set the HBASE_MANAGES_ZK to true to indicate that HBase is supposed to manage the zookeeper ensemble internally.

export HBASE_MANAGES_ZK=true

Open the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties.

<configuration> 
<property> 
<name>hbase.master</name> 
 <value><master-hostname>:60000</value> 
</property> 
<property> 
 <name>hbase.rootdir</name> 
 <value>hdfs://<master-hostname>:9000/hbase</value> 
</property> 
<property> 
 <name>hbase.cluster.distributed</name> 
<value>true</value> 
</property> 
</configuration>

Case II- When HBase manages the Zookeeper ensemble externally

Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh –

   export HBASE_MANAGES_ZK=false

For this configuration add two more properties in hbase-site.xml

<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
</property> 
 <property> 
 <name>hbase.zookeeper.quorum</name> 
 <value><master-hostname></value>
</property>

 Note:-In our case, Zookeeper and hbase master both are running in same machine.

6. Edit the /conf/regionservers file on all the hbase cluster nodes. Add the hostnames of all the region server nodes. For eg.

10.10.10.2

7. Repeat same procedure for all the masters and region servers.

Start and Stop Hbase cluster

8. Starting the Hbase Cluster

 

Before starting hbase cluster start zookeeper if it externally managed .Go to <zookeeper_home>/bin

     ./zkServer.sh start

 

we have need to start the daemons only on the hbase-master machine, it will start the daemons in all regionserver machines.

Execute the following command to start the hbase cluster.

    $HBASE_INSTALL_DIR/bin/start-hbase.sh

Note:-

At this point, the following Java processes should run on hbase-master machine.

xxx@master:$jps
           14143 Jps
           14007 HquorumPeer/QuorumPeerMain(if zookeeper managed externally)
           14066 Hmaster
           9561 SecondaryNameNode
           9133 NameNode
           9783 ResourceManager

 

and the following java processes should run on hbase-regionserver machine.

           23026 HRegionServer
           23171 Jps
           9311 DataNode
           9966 NodeManager

 

9. Starting the hbase shell:-

 $HBASE_INSTALL_DIR/bin/hbase shell
 HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.4, r14c0e77956f9bb4c6edf0378474264843e4a82c3, Wed Mar 16 21:18:26 PDT 2016
 hbase(main):001:0>
 hbase(main):001:0>create 't1','f1'
  0 row(s) in 1.2910 seconds
hbase(main):002:0>

 

Note: – If table is created successfully, then everything is running fine.

 

10. Stoping the Hbase Cluster:-

Execute the following command on hbase-master machine to stop the hbase cluster.

     $HBASE_INSTALL_DIR/bin/stop-hbase.sh

 

On top of hbase we can install Apache Phoenix , which is a SQL layer on Hbase. For installation of phoenix you can refer my post Installing Phoenix – A step by step tutorial

 

Installing Hadoop in fully distributed mode

Pre-requisite

  1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

  1. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of multi node setup machine should be able to passwordless ssh from/to all machines of cluster.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub impadmin@slave

Installing And Configuring Hadoop

Assumptions –

For the purpose of clarity and ease of expression, I’ll be assuming that we are setting up a cluster of 2 nodes with IP Addresses

 10.10.10.1 ‚Äď Namenode
 10.10.10.2 ‚Äď Datanode
  1. Download hadoop-2.6.4 and extract the installation tar on all the nodes on the same path.Dedicated user for hadoop (We assume dedicated user is “impadmin”)

Make sure that master and all the slaves have the same user.

  1. Setup environment variables

Export environment variables as mentioned below for all nodes in the cluster.

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

For this demo we have modified ‚Äúhadoop-env.sh‚ÄĚ for exporting the variables.You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export these variables.

Add following lines at start of script in etc/hadoop/yarn-env.sh :

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:.
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

3. Create a folder for hadoop.tmp.dir
Create a temp folder in HADOOP_PREFIX

mkdir -p $HADOOP_PREFIX/tmp

4. Tweak config files
For all the machines in cluster, go to etc/hadoop folder under HADOOP_ PREFIX and add the following properties under configuration tag in the files mentioned below

etc/hadoop/core-site.xml –

<property>
<name>fs.default.name</name>
<value>hdfs://Master-Hostname:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/impadmin/hadoop-2.6.4/tmp</value>
</property>

etc/hadoop/hdfs-site.xml :

<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

etc/hadoop/mapred-site.xml :

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

etc/hadoop/yarn-site.xml :

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Master-Hostname:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Master-Hostname:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>Master-Hostname:8040</value>
</property>

Note: Make sure to replace ‚ÄúMaster-Hostname‚ÄĚ with your cluster’s master host name.

  1. Add slaves

Update HADOOP-PREFIX/etc/hadoop/slaves to add the slave entries on master machine.

Open ‚Äúslaves‚ÄĚ and enter hostname of all the slaves separated by line feed.

  1. Format namenode

This is one time activity. On master execute following command from HADOOP-PREFIX .

         $ bin/hadoop namenode -format or
         $ bin/hdfs namenode -format

Once you have your data on HDFS DONOT run this command, doing so will result in loss of content.

  1. Run hadoop daemons

From master execute below commands,Start DFS daemons:

From HADOOP-HOME execute

$sbin/start-dfs.sh
$jps
Processes which should run after starting master
NameNode
SecondaryNameNode
JPS

 Check on slave whether DFS daemons started or not :

$jps

Processes running on slaves is -
DataNode
JPS

Start YARN daemons:
From HADOOP_HOME execute

$sbin/startyarn.sh
$jps

Processes running on master - 
NameNode
SecondaryNode
ResourceManager
JPS

Check on slave whether DFS daemons started or not :

$jps

Processes running on slaves is -
DataNode
JPS
NodeManager
  1. Run sample and validate

Let’s run the wordcount sample to validate the setup. Make an input file/directory.

$ mkdir input
$ cat > input/file
This is a sample file.
This is a sample line.

    Add this directory to HDFS:

    $bin/hdfs dfs -copyFromLocal input /input2

 

Run example:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-
3*.jar wordcount /input /output

To check the ouptput execute below command:

   $ bin/hdfs dfs -cat /output/*

 

  1. Web interface

We can browse HDFS and check health using http://masterHostname:50070 in the browser.Also we can check the status of the applications running using the following

URL: http://masterHostname:9000

Done !!

Installing Phoenix – A step by step tutorial

Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.

Prerequisites –

1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

2. Make sure you have installed Hbase on your machine, for that refer my post Hbase Installation in Pseudo-distributed mode.

Install Phoenix –

1. Download Phoenix-4.7.0-Hbase-1.1  and expand the installation tar.

tar -zxvf phoenix-4.7.0-Hbase-1.1-bin.tar.gz

2. Add the phoenix-[version]-server.jar to the classpath of HBase region server and master and remove any previous version. An easy way to do this is to copy it into the HBASE_INSTALL_DIR/lib directory.

3. Add the phoenix-[version]-client.jar to the Phoenix client.

4. Restart Hbase.

5. Run an example to test everything is working fine :

Open the Command Line РA terminal interface to execute SQL from the command line is now  bundled with Phoenix. To start it, execute the following from the bin directory:

$ sqlline.py localhost

And if Zookeeper is running externally then , in our case zookeeper is running on master node. Therefore run,

   $ sqlline.py <master-hostname>:2181

a. First, let’s create a us_population.sql file, containing a table definition:

CREATE TABLE IF NOT EXISTS us_population (
 state CHAR(2) NOT NULL,
 city VARCHAR NOT NULL,
 population BIGINT
 CONSTRAINT my_pk PRIMARY KEY (state, city));

b. Now let’s create a us_population.csv file containing some data to put in that table:

 NY,New York,8143197
 CA,Los Angeles,3844829
 IL,Chicago,2842518
 TX,Houston,2016582
 PA,Philadelphia,1463281
 AZ,Phoenix,1461575
 TX,San Antonio,1256509
 CA,San Diego,1255540
 TX,Dallas,1213825
 CA,San Jose,912332

c. And finally, let’s create a us_population_queries.sql file containing a query we’d like to run on that data.

SELECT state as "State",count(city) as "City Count",sum(population) as "Population Sum"
 FROM us_population
 GROUP BY state
 ORDER BY sum(population) DESC;

d. Loading Data : In addition, you can use the bin/psql.py to load CSV data or execute SQL     scripts. For example:

./psql.py <your_zookeeper_quorum> us_population.sql us_population.csv 
us_population_queries.sql

You have created a table in phoenix , inserted the data and run the query.

 

This is it ūüôā

Hbase Installation in Pseudo-Distributed mode

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase on Hadoop File Systems, and ways to interact with HBase shell. It also describes how to connect to HBase using java, and how to perform basic operations on HBase using java.

Prerequisites :

1. Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

2. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.

 $ ssh-keygen -t rsa 
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
 $ chmod 0600 ~/.ssh/authorized_keys

3. Before we start configure HBase, you need to have a running Hadoop, which will be the storage for hbase(Hbase store data in Hadoop Distributed File System). Please refere to Hadoop-Yarn Installation in Pseudo-distributed mode post before continuing.

Installing And Configuring Hbase

1. Download the latest stable version of HBase form http://www.interior-dsgn.com/apache/hbase/stable/ using ‚Äúwget‚ÄĚ command, and extract it using the tar ‚Äúzxvf‚ÄĚ command. See the following command.

$ wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-1.1.4-bin.tar.gz
$ tar -zxvf hbase-1.1.4-bin.tar.gz

 

2. Go to <HBASE_HOME>/conf/hbase-env.sh

Export JAVA_HOME environment variable in hbase-env.sh file as shown below:

Export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67

Go to <HBASE_HOME>/conf/hbase-site.xml

 

Inside the hbase-site.xml file, you will find the <configuration> and </configuration> tags. Within them, set the HBase directory under the property key with the name ‚Äúhbase.rootdir‚ÄĚ as shown below.

<configuration>
   //Here you have to set the path where you want HBase to store its files.
   <property>
      <name>hbase.rootdir</name>
      <value>hdfs://localhost:9000/hbase</value>
   </property>
        
   //Here you have to set the path where you want HBase to store its built in zookeeper  files.
   <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/home/hadoop/zookeeper</value>
   </property>
   <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
   </property>
</configuration>

3. Starting HBase

After configuration is over, browse to HBase home folder and start HBase using the following command.

$bin/start-hbase.sh

4. Checking the HBase Directory in HDFS
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the following command.

 $ ./bin/hadoop fs -ls /hbase

If everything goes well, it will give you the following output.
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs

5. Run a sample example ,

Go to <HBASE_HOME> and run command,

$ bin/hbase shell

Create a table

Use the create command to create a new table. We must specify the table name and the ColumnFamily name:

hbase(main):001:0> create 'test', 'cf'
0 row(s) in 3.3340 seconds

=> Hbase::Table - test

Populating the data

Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in the case below:

hbase(main):008:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 1.3280 seconds

hbase(main):009:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0340 seconds

hbase(main):010:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0150 seconds
Scanning the table for all data at once

We can get data from HBase using scan. We can limit our scan, but for now, all data is fetched:

hbase(main):011:0> scan 'test'
ROW                               COLUMN+CELL                                                                                     
 row1                             column=cf:a, timestamp=1427820136323, value=value1                                              
 row2                             column=cf:b, timestamp=1427820144111, value=value2                                              
 row3                             column=cf:c, timestamp=1427820153067, value=value3                                              
3 row(s) in 0.1650 seconds
Get a single row of data –

To get a single row of data at a time, we can use the get command.

hbase(main):012:0> get 'test', 'row1'
COLUMN                            CELL                                                                                            
 cf:a                             timestamp=1427820136323, value=value1                                                           
1 row(s) in 0.0650 seconds

 

ūüôā

Hadoop-Yarn Installation in Pseudo-distributed mode

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Prerequisites :

 1 .Java JDK (This demo uses JDK version 1.7.0_67)

Make sure the JAVA_HOME system environment variable points to the JDK. Make sure the java executable’s directory is in the PATH environment variable, i.e., %JAVA_HOME%\bin.

 2. SSH configured

Make sure that machines in Hadoop cluster are able to do a password-less ssh. In case of single node setup machine should be able to ssh localhost.

 $ ssh-keygen -t rsa 
 $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
 $ chmod 0600 ~/.ssh/authorized_keys 

Installation Steps:

3. Download hadoop-2.6.4.tar.gz from http://hadoop.apache.org/releases.html and extract to some path in your machine. Assuming that ‚Äúimpadmin‚ÄĚ is the dedicated user for Hadoop.

4. Setup environment variables
Export below mentioned environment variables .

 JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 
 HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
 PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. 
 HADOOP_COMMON_HOME=$HADOOP_ PREFIX 
 HADOOP_HDFS_HOME=$HADOOP_ PREFIX 
 YARN_HOME=$HADOOP_ PREFIX 
 HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 
 YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 

For this demo we have modified ‚Äúhadoop-env.sh‚ÄĚ for exporting the variables.
You can also use ~/.bashrc , /etc/bash.bashrc or other startup script to export
these variables.

5. Create HDFS directories
Create two directories to be used by namenode and datanode.

Go to <HADOOP_PREFIX>,

 mkdir -p hdfs/namenode
 mkdir -p hdfs/datanode 
list folders ,
 ls -r hdfs
You will see - 
 namenode datanode 

6. Tweak config files
Go to etc/hadoop folder under HADOOP_PREFIX and add the following
properties under configuration tag in the files mentioned below: 

etc/hadoop/yarn-site.xml:

 <property> 
 <name>yarn.nodemanager.aux-services</name> 
 <value>mapreduce_shuffle</value> 
 </property> 
 <property> 
 <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
 <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
 </property> 

etc/hadoop/core-site.xml:

<property> 
 <name>fs.default.name</name> 
 <value>hdfs://localhost:9000 
 </property>

etc/hadoop/hdfs-site.xml:

<property> 
 <name>dfs.replication</name> 
 <value>1</value> 
 </property> 
 <property> 
 <name>dfs.namenode.name.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/namenode</value> 
 </property> 
 <property> 
 <name>dfs.datanode.data.dir</name> 
 <value>file:<HADOOP_PREFIX>/hdfs/datanode</value> 
 </property>

etc/hadoop/mapred-site.xml:
If this file does not exist, create it and paste the content provided below:

<?xml version="1.0"?>
 <configuration> 
 <property> 
 <name>mapreduce.framework.name</name> 
 <value>yarn</value> 
 </property> 
 </configuration>

7. Format namenode
This is one time activity.

$ bin/hadoop namenode -format 
 or 
 $ bin/hdfs namenode -format 

Once you have your data on HDFS DONOT run this command, doing so will
result in loss of content.

8.Run hadoop daemons

Start DFS daemons:
From <HADOOP-PREFIX> execute

 $ sbin/start-dfs.sh
 $ jps

you will see following processes running at this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode

Start YARN daemons:
From HADOOP_ PREFIX execute

 $ sbin/startyarn.sh
 $ jps

you will see following processesat this point –

 18831 SecondaryNameNode
 18983 JPS
 18343 NameNode
 18563 DataNode
 19312 NodeManager
 19091 ResourceManager

Note: you can also use start-all.sh and stop-all.sh for starting/stopping the daemons.

Start Job History Server:
From HADOOP_PREFIX execute

sbin/mr-jobhistory-daemon.sh start historyserver

9. Run sample and validate
Let’s run the wordcount sample to validate the setup.
Make an input file/directory.

$ mkdir input 
 $ cat > input/file 
 This is a sample file. 
 This is a sample line. 

Add this directory to HDFS:

$bin/hdfs dfs -copyFromLocal input /input 

Run example:

$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 
 3*.jar wordcount /input /output

To check the ouptput execute below command:

$ bin/hdfs dfs -cat /output/*

8. Web interface
We can browse HDFS and check health using http://localhost:50070 in the
browser.

 

Installation Completed ūüôā

Installing Apache Spark in local mode on Windows 8

In this post i will walk through the process of downloading and running Apache Spark on Windows 8 X64 in local mode on a single computer.

Prerequisites

  1. Java Development Kit (JDK either 7 or 8)¬†( I installed it on this path ‚ÄėC:\Program Files\Java\jdk1.7.0_67‚Äô).
  2. Scala¬†2.11.7 ( I installed it on this path ‚ÄėC:\Program Files (x86)\scala‚Äô . This is optional).
  3. After installation, we need to set the following environment variables:
    1. JAVA_HOME , the value is JDK path.
      In my case it will be ‚ÄėC:\Program Files\Java\jdk1.7.0_67‚Äô. for more details click here.
      Then append it to PATH environment variable as ‚Äė%JAVA_HOME%\bin‚Äô .
    2. SCALA_HOME,
      In my case it will be ¬†‘C:\Program Files (x86)\scala’.
      Then¬†append it to PATH environment variable as ‚Äė%SCALA_HOME%\bin‚Äô .

Downloading and installing Spark

  1. It is easy to follow the instructions on http://spark.apache.org/docs/latest/ and download Spark 1.6.0 (Jan 04 2016) with the “Pre-build for Hadoop 2.6 and later” package type from http://spark.apache.org/downloads.html

spark1

2. Extract the zipped file to D:\Spark.

3.¬†Spark has two shells, they are existed in ‚ÄėC:\Spark\bin\‚Äô directory :

       a. Scala shell (C:\Spark\bin\spark-shell.cmd).
b .Python shell (C:\Spark\bin\pyspark.cmd).

4. You can run of one them, and you will see the following exception:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.This issue is often caused by a missing winutils.exe file that Spark needs in order to initialize the Hive context, which in turn depends onHadoop, which requires native libraries on Windows to work properly. Unfortunately, this happens even if you are using Spark in local mode without utilizing any of the HDFS features directly.

issue-screen

To resolve this problem, you need to:

a. download the 64-bit winutils.exe (106KB)

b. copy the downloaded file winutils.exe into a folder like D:\hadoop\bin (or                     D:\spark\hadoop\bin)

c. set the environment variable HADOOP_HOME to point to the above directory but without \bin. For example:

  • if you copied the winutils.exe to D:\hadoop\bin, set HADOOP_HOME=D:\hadoop
  • if you copied the winutils.exe to D:\spark\hadoop\bin, set HADOOP_HOME=D:\spark\hadoop

d. Double-check that the environment variable HADOOP_HOME is set properly by         opening the Command Prompt and running echo %HADOOP_HOME%

e. You will also notice that when starting the spark-shell.cmd, Hive will create a C:\tmp\hive folder. If you receive any errors related to permissions of this folder, use the following commands to set that permissions on that folder:

  • List current permissions: %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive
  • Set permissions: %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive
  • List updated permissions: %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive

5. Re-run spark-shell,it should work as expected.

Text search sample

program-spark

Hope that will help !

 

Bulk Insert Java API of Elasticsearch

The bulk API allows one to index and delete several documents in a single request. Here is a sample usage

accounts.json (Json file which needs to be inserted in elasticsearch)

acccounts (download file from the given link)

Sample data present in file:

{“index”:{“_id”:”1″}}
{“account_number”:1,”balance”:39225,”firstname”:”Amber”,
“lastname”:”Duke”,”age”:32,”gender”:”M”,”address”:”880 Holmes Lane”,”employer”:”Pyrami”,”email”:”amberduke@pyrami.com”,”city”:”Brogan”,
“state”:”IL”}

Java File

package com.elasticsearch.index;

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.InetAddress;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
public class IndexData {

public static void main(String[] args) throws ParseException, IOException {

// configuration setting
Settings settings = Settings.settingsBuilder()
.put(“cluster.name”, “test-cluster”).build();
TransportClient client = TransportClient.builder().settings(settings).build();

String hostname = “<Your-Hostname>”;
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(hostname),port));

// bulk API
BulkRequestBuilder bulkBuilder = client.prepareBulk();

long bulkBuilderLength = 0;
String readLine = “”;
String index = “testindex”;
String type = “testtype”;
String _id = null;

BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream(“accounts.json”))));
JSONParser parser = new JSONParser();

while((readLine = br.readLine()) != null){
// it will skip the metadata line which is supported by bulk insert format
if (readLine.startsWith(“{\”index”)){
continue;
} else {

Object json = parser.parse(readLine);
if(((JSONObject)json).get(“account_number”)!=null){
_id = String.valueOf(((JSONObject)json).get(“account_number”));
System.out.println(_id);
}

//_id is unique field in elasticsearch
JSONObject jsonObject = (JSONObject) json;
bulkBuilder.add(client.prepareIndex(index, type, String.valueOf(_id)).setSource(jsonObject));
bulkBuilderLength++;

try {
if(bulkBuilderLength % 100== 0){
System.out.println(“##### ” + bulkBuilderLength + ” data indexed.”);
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println(“##### Bulk Request failure with error: ” + ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
br.close();

if(bulkBuilder.numberOfActions() > 0){
System.out.println(“##### ” + bulkBuilderLength + ” data indexed.”);
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println(“##### Bulk Request failure with error: ” + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}

}

Maven dependencies:

<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.googlecode.json-simple</groupId>
<artifactId>json-simple</artifactId>
<version>1.1</version>
</dependency>

Hit the elasticsearch URL to check the number of records inserted :

curl XGET https://<ip-address&gt;:9200/corsearch/unified/_count

Response : 
{
count: 1000,
_shards:{
total: 3,
successful: 3,
failed: 0
}
}