Hadoop

Install Hadoop Multinode Cluster using CDH4 in RHEL/CentOS 6.5

Install Hadoop Multinode Cluster using CDH4 in RHEL/CentOS 6.5 &-8211; this Article or News was published on this date:2019-05-28 19:18:36 kindly share it with friends if you find it helpful

Hadoop is an open source programing framework developed by apache to process big data. It uses HDFS (Hadoop Distributed File System) to store the data across all the datanodes in the cluster in a distributive manner and mapreduce model to process the data.

Install Hadoop Multinode Cluster in CentOSInstall Hadoop Multinode Cluster in CentOS

Install Hadoop Multinode Cluster

Namenode (NN) is a master daemon which controls HDFS and Jobtracker (JT) is master daemon for mapreduce engine.

Requirements

In this tutorial I’m using two CentOS 6.3 VMs ‘master‘ and ‘node‘ viz. (master and node are my hostnames). The ‘master’ IP is 172.21.17.175 and node IP is ‘172.21.17.188‘. The following instructions also works on RHEL/CentOS 6.x versions.

On Master
[[email protected] ~]- hostname

master
[[email protected] ~]- ifconfig|grep 'inet addr'|head -1

inet addr:172.21.17.175  Bcast:172.21.19.255  Mask:255.255.252.0
On Node
[[email protected] ~]- hostname

node
[[email protected] ~]- ifconfig|grep 'inet addr'|head -1

inet addr:172.21.17.188  Bcast:172.21.19.255  Mask:255.255.252.0

First make sure that all the cluster hosts are there in ‘/etc/hosts‘ file (on each node), if you do not have DNS set up.

On Master
[[email protected] ~]- cat /etc/hosts

172.21.17.175 master
172.21.17.188 node
On Node
[[email protected] ~]- cat /etc/hosts

172.21.17.197 qabox
172.21.17.176 ansible-ground

Installing Hadoop Multinode Cluster in CentOS

We use official CDH repository to install CDH4 on all the hosts (Master and Node) in a cluster.

Step 1: Download Install CDH Repository

Go to official CDH download page and grab the CDH4 (i.e. 4.6) version or you can use following wget command to download the repository and install it.

On RHEL/CentOS 32-bit
- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/i386/cloudera-cdh-4-0.i386.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm
On RHEL/CentOS 64-bit
- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Before installing Hadoop Multinode Cluster, add the Cloudera Public GPG Key to your repository by running one of the following command according to your system architecture.

-- on 32-bit System --

- rpm --import http://archive.cloudera.com/cdh4/redhat/6/i386/cdh/RPM-GPG-KEY-cloudera
-- on 64-bit System --

- rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Step 2: Setup JobTracker & NameNode

Next, run the following command to install and setup JobTracker and NameNode on Master server.

[[email protected] ~]- yum clean all 
[[email protected] ~]- yum install hadoop-0.20-mapreduce-jobtracker
[[email protected] ~]- yum clean all
[[email protected] ~]- yum install hadoop-hdfs-namenode

Step 3: Setup Secondary Name Node

Again, run the following commands on the Master server to setup secondary name node.

[[email protected] ~]- yum clean all 
[[email protected] ~]- yum install hadoop-hdfs-secondarynam

Step 4: Setup Tasktracker & Datanode

Next, setup tasktracker & datanode on all cluster hosts (Node) except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts ( on node in this case ).

[[email protected] ~]- yum clean all
[[email protected] ~]- yum install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

Step 5: Setup Hadoop Client

You can install Hadoop client on a separate machine ( in this case I have installed it on datanode you can install it on any machine).

[[email protected] ~]- yum install hadoop-client

Step 6: Deploy HDFS on Nodes

Now if we are done with above steps let’s move forward to deploy hdfs (to be done on all the nodes ).

Copy the default configuration to /etc/hadoop directory ( on each node in cluster ).

[[email protected] ~]- cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster
[[email protected] ~]- cp -r /etc/hadoop/conf.dist /etc/hadoop/conf.my_cluster

Use alternatives command to set your custom directory, as follows ( on each node in cluster ).

[[email protected] ~]- alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
reading /var/lib/alternatives/hadoop-conf

[[email protected] ~]- alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
[[email protected] ~]- alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
reading /var/lib/alternatives/hadoop-conf

[[email protected] ~]- alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

Step 7: Customizing Configuration Files

Now open ‘core-site.xml‘ file and update “fs.defaultFS” on each node in cluster.

[[email protected] conf]- cat /etc/hadoop/conf/core-site.xml
?xml version="1.0"?>
?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
configuration>
property>
 name>fs.defaultFS/name>
 value>hdfs://master//value>
/property>
/configuration>
[[email protected] conf]- cat /etc/hadoop/conf/core-site.xml
?xml version="1.0"?>
?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
configuration>
property>
 name>fs.defaultFS/name>
 value>hdfs://master//value>
/property>
/configuration>

Next update “dfs.permissions.superusergroup” in hdfs-site.xml on each node in cluster.

[[email protected] conf]- cat /etc/hadoop/conf/hdfs-site.xml
?xml version="1.0"?>
?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
configuration>
  property>
     name>dfs.name.dir/name>
     value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name/value>
  /property>
  property>
     name>dfs.permissions.superusergroup/name>
     value>hadoop/value>
  /property>
/configuration>
[[email protected] conf]- cat /etc/hadoop/conf/hdfs-site.xml
?xml version="1.0"?>
?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
configuration>
  property>
     name>dfs.name.dir/name>
     value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name/value>
  /property>
  property>
     name>dfs.permissions.superusergroup/name>
     value>hadoop/value>
  /property>
/configuration>

Note: Please make sure that, the above configuration is present on all the nodes (do on one node and run scp to copy on rest of the nodes ).

Step 8: Configuring Local Storage Directories

Update “dfs.name.dir or dfs.namenode.name.dir” in ‘hdfs-site.xml’ on the NameNode ( on Master and Node ). Please change the value as highlighted.

[[email protected] conf]- cat /etc/hadoop/conf/hdfs-site.xml
property>
 name>dfs.namenode.name.dir/name>
 value>file:///data/1/dfs/nn,/nfsmount/dfs/nn/value>
/property>
[[email protected] conf]- cat /etc/hadoop/conf/hdfs-site.xml
property>
 name>dfs.datanode.data.dir/name>
 value>file:///data/1/dfs/dn,/data/2/dfs/dn,/data/3/dfs/dn/value>
/property>

Step 9: Create Directories & Manage Permissions

Execute below commands to create directory structure & manage user permissions on Namenode (Master) and Datanode (Node) machine.

[[email protected]]- mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
[[email protected]]- chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
[[email protected]]-  mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
[[email protected]]-  chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

Format the Namenode (on Master), by issuing following command.

[[email protected] conf]- sudo -u hdfs hdfs namenode -format

Step 10: Configuring the Secondary NameNode

Add the following property to the hdfs-site.xml file and replace value as shown on Master.

property>
  name>dfs.namenode.http-address/name>
  value>172.21.17.175:50070/value>
  description>
    The address and port on which the NameNode UI will listen.
  /description>
/property>

Note: In our case value should be ip address of master VM.

Now let’s deploy MRv1 ( Map-reduce version 1 ). Open ‘mapred-site.xml‘ file following values as shown.

[[email protected] conf]- cp hdfs-site.xml mapred-site.xml
[[email protected] conf]- vi mapred-site.xml
[[email protected] conf]- cat mapred-site.xml
?xml version="1.0"?>
?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

configuration>
property>
 name>mapred.job.tracker/name>
 value>master:8021/value>
/property>
/configuration>

Next, copy ‘mapred-site.xml‘ file to node machine using the following scp command.

[[email protected] conf]- scp /etc/hadoop/conf/mapred-site.xml node:/etc/hadoop/conf/
mapred-site.xml                                                                      100%  200     0.2KB/s   00:00

Now configure local storage directories to use by MRv1 Daemons. Again open ‘mapred-site.xml‘ file and make changes as shown below for each TaskTracker.

property>
 name>mapred.local.dir/name>
 value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local/value>
/property>

After specifying these directories in the ‘mapred-site.xml‘ file, you must create the directories and assign the correct file permissions to them on each node in your cluster.

mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local

Step 10 : Start HDFS

Now run the following command to start HDFS on every node in the cluster.

[[email protected] conf]- for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
[[email protected] conf]- for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Step 11 : Create HDFS /tmp and MapReduce /var Directories

It is required to create /tmp with proper permissions exactly as mentioned below.

[[email protected] conf]- sudo -u hdfs hadoop fs -mkdir /tmp
[[email protected] conf]- sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
[[email protected] conf]- sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[[email protected] conf]- sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
[[email protected] conf]- sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Now verify the HDFS File structure.

[[email protected] conf]- sudo -u hdfs hadoop fs -ls -R /

drwxrwxrwt   - hdfs hadoop          	0 2014-05-29 09:58 /tmp
drwxr-xr-x   	- hdfs hadoop          	0 2014-05-29 09:59 /var
drwxr-xr-x  	- hdfs hadoop          	0 2014-05-29 09:59 /var/lib
drwxr-xr-x   	- hdfs hadoop         	0 2014-05-29 09:59 /var/lib/hadoop-hdfs
drwxr-xr-x   	- hdfs hadoop          	0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   	- mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   	- mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred hadoop          0 2014-05-29 09:59 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

After you start HDFS and create ‘/tmp‘, but before you start the JobTracker please create the HDFS directory specified by the ‘mapred.system.dir’ parameter (by default ${hadoop.tmp.dir}/mapred/system and change owner to mapred.

[[email protected] conf]- sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
[[email protected] conf]- sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

Step 12: Start MapReduce

To start MapReduce : please start the TT and JT services.

On each TaskTracker system
[[email protected] conf]- service hadoop-0.20-mapreduce-tasktracker start

Starting Tasktracker:                               [  OK  ]
starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-node.out
On the JobTracker system
[[email protected] conf]- service hadoop-0.20-mapreduce-jobtracker start

Starting Jobtracker:                                [  OK  ]

starting jobtracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-jobtracker-master.out

Next, create a home directory for each hadoop user. it is recommended that you do this on NameNode; for example.

[[email protected] conf]- sudo -u hdfs hadoop fs -mkdir  /user/user>
[[email protected] conf]- sudo -u hdfs hadoop fs -chown user> /user/user>

Note: where is the Linux username of each user.

Alternatively, you cancreate the home directory as follows.

[[email protected] conf]- sudo -u hdfs hadoop fs -mkdir /user/$USER
[[email protected] conf]- sudo -u hdfs hadoop fs -chown $USER /user/$USER

Step 13: Open JT, NN UI from Browser

Open your browser and type the url as http://ip_address_of_namenode:50070 to access Namenode.

Install Hadoop Multinode Cluster in CentOSHadoop NameNode Interface

Hadoop NameNode Interface

Open another tab in your browser and type the url as http://ip_address_of_jobtracker:50030 to access JobTracker.

Install Hadoop Multinode Cluster in CentOSHadoop Map/Reduce Administration

Hadoop Map/Reduce Administration

This procedure has been successfully tested on RHEL/CentOS 5.X/6.X. Please comment below if you face any issues with the installation, I will help you out with the solutions.

Install and Configure Apache Oozie Workflow Scheduler for CDH 4.X on RHEL/CentOS 6/5

Install and Configure Apache Oozie Workflow Scheduler for CDH 4.X on RHEL/CentOS 6/5 &-8211; this Article or News was published on this date:2019-05-28 19:17:42 kindly share it with friends if you find it helpful

Oozie is an open source scheduler for Hadoop, it simplifies workflow and coordina­tion between jobs. We can define dependency between jobs for an input data and hence can automate job dependency using ooze scheduler.

Install Oozie in Centos and RHELInstall Oozie in Centos and RHEL

Install Oozie in Centos and RHEL

In this tutorial, I have installed Oozie on my master node (i.e. master as hostname and where namenode/JT are installed) however in production system oozie should be installed on separate Hadoop node.

The installation instructions are divided into two parts, we call it A and B.

  1. A. Oozie Installation.
  2. B. Oozie Configuration.

Let’s first verify system hostname, using following ‘hostname‘ command.

[[email protected]]- hostname

master

Method A: Oozie Installation on RHEL/CentOS 6/5

We use official CDH repository from cloudera’s site to install CDH4. Go to official CDH download section and download CDH4 (i.e. 4.6) version or you can also use following wget command to download the repository and install it.

On RHEL/CentOS 6
- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/i386/cloudera-cdh-4-0.i386.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
On RHEL/CentOS 5
- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/5/i386/cloudera-cdh-4-0.i386.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

- wget http://archive.cloudera.com/cdh4/one-click-install/redhat/5/x86_64/cloudera-cdh-4-0.x86_64.rpm
- yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

Once, you’ve added CDH repository under your system, you can use following command to install Oozie on the system.

[[email protected] ~]- yum install oozie

Now, install oozie client (above command should cover client installation part however if not then try below command).

[[email protected] ~]- yum install oozie-client

Note: The above installation also configures oozie service to run at system startup. Good job! We are done with the first part of installation now let’s move to the second part to configure oozie.

Method B: Oozie Configuration on RHEL/CentOS 6/5

As oozie does not directly interact with Hadoop, we do not need any mapped configuration here.

Caution: Please configure all the settings while oozie is not running, that means you have to follow below steps while oozie service is not running.

Oozie has ‘Derby‘ as default built in DB however, I would recommend that you use Mysql DB. So, let’s install MySQL database using the following article.

  1. Install MySQL Database in RHEL/CentOS 6/5

Once you are done with the installation part, next move further to create oozie DB and grant privileges as shown below.

[[email protected] ~]- mysql -uroot -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 3
Server version: 5.5.38 MySQL Community Server (GPL) by Remi

Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

mysql> create database oozie;
Query OK, 1 row affected (0.00 sec)

mysql> grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';
Query OK, 0 rows affected (0.00 sec)

mysql> exit
Bye

Next, configure Oozie properties for MySQL. Open ‘oozie-site.xml‘ file and edit following properties as shown.

[[email protected] ~]- cd /etc/oozie/conf
[[email protected] conf]- vi oozie-site.xml

Enter following properties ( just replace master [my hostname] with your hostname).

property>
        name>oozie.service.JPAService.jdbc.driver/name>
        value>com.mysql.jdbc.Driver/value>
    /property>
    property>
        name>oozie.service.JPAService.jdbc.url/name>
        value>jdbc:mysql://master:3306/oozie/value>
    /property>
    property>
        name>oozie.service.JPAService.jdbc.username/name>
        value>oozie/value>
    /property>
    property>
        name>oozie.service.JPAService.jdbc.password/name>
        value>oozie/value>
    /property>

Download and add the MySQL JDBC connectivity driver JAR to Oozie lib directory. To do so, run the following serious of command on the terminal.

[[email protected] oozie]- cd /tmp/
[[email protected] tmp]- wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.31.tar.gz
[[email protected] tmp]- tar -zxf mysql-connector-java-5.1.31.tar.gz	
[[email protected] tmp]- cd mysql-connector-java-5.1.31
[[email protected] mysql-connector-java-5.1.31]- cp mysql-connector-java-5.1.31-bin.jar /var/lib/oozie/

Create oozie database schema by executing below commands and please note that this should be run as oozie user.

[[email protected] ~]- sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
Sample Output
setting OOZIE_CONFIG=/etc/oozie/conf
setting OOZIE_DATA=/var/lib/oozie
setting OOZIE_LOG=/var/log/oozie
setting OOZIE_CATALINA_HOME=/usr/lib/bigtop-tomcat
setting CATALINA_TMPDIR=/var/lib/oozie
setting CATALINA_PID=/var/run/oozie/oozie.pid
setting CATALINA_BASE=/usr/lib/oozie/oozie-server-0.20
setting CATALINA_OPTS=-Xmx1024m
setting OOZIE_HTTPS_PORT=11443
...
DONE
Oozie DB has been created for Oozie version '3.3.2-cdh4.7.0'
The SQL commands have been written to: /tmp/ooziedb-8250405588513665350.sql

You need to download ExtJS lib from internet to enable oozie web console. Go to official CDH ExtJS page, and download ExtJS version 2.2 libraries or you can download the package using the following command.

[[email protected] ~]- cd /tmp/
[[email protected] tmp]- wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip
[[email protected] tmp]- unzip ext-2.2.zip
[[email protected] tmp]- mv ext-2.2 /var/lib/oozie/

Finally, start the oozie server, by running following commands.

[[email protected] tmp]- service oozie status
not running.

[[email protected] tmp]- service oozie start

[[email protected] tmp]- service oozie status
running

[[email protected] tmp]- oozie admin -oozie http://localhost:11000/oozie -status
System mode: NORMAL

Open oozie UI using your favorite browser, and point to your IP address. In this case, my IP is 192.168.1.129.

http://192.168.1.129:11000
Install Oozie in Centos and RHELOozie Dashoard

Oozie Dashoard

Now if you see this UI. Congratulations!! You have successfully configured oozie.

This procedure has been successfully tested on RHEL/CentOS 6/5. In my upcoming articles, I’m going to share how to configure and schedule hadoop jobs via oozie. Stay connected for more and don’t forget to comment your feedback.