This blog is the first of two which will cover using the Cloudera distribution: #1 Pre-Cloudera Setup/Architecture and #2 Cloudera Installation.
The target audience are companies with data that is on the cusp of the “big data” label (think the three V’s). We will be placing a larger emphasis on the pre-cluster setup from a freshly installed OS to being read to install Cloudera via Cloudera Manager. This is often overlooked and always leads to trouble from the get-go.
In the past, I have worked with a few companies that were considering making the leap into the big data world. This can be quite daunting of a task; especially with the ever-changing Hadoop ecosystem.
In almost all these cases, it was beneficial to demo the available quick start tools that some of the major Hadoop distributions provide, namely Cloudera and Hortonworks. Even though the quick start tools take advantage of pseudo-Hadoop, they generally do not provide enough flexibility to scale out to prove more complex data scenarios. From an engineering standpoint, these types of demonstrations often provide little value in proving the feasibility of making the leap to Hadoop, but can however be used for individual learning.
The following illustration shows the architecture we will be using for this POC setup. Basically, it will be a 4 node cluster with one external connection that will be discussed in an additional blog after the Post-Cloudera Installation Configuration. It is always preferable to use actual dedicated Linux machines. CentOS 6.5+ works well, but oftentimes when doing a POC, these may not be handy, so for this 4 node cluster, we will use one beefy Windows Server box and split it into 4 virtual machines using Oracle Virtualbox.
The machines can follow any naming convention, but for this blog we will name them with the hdp prefix (hdp for Hadoop), and the purpose of the machine as the suffix along with a number. MST will denote the Master or NameNode, SNN would be Secondary NameNode which is not going to be used, and SLV# will denote a DataNode.
Hdpmst1: 64GB RAM, 4 Cores, 1x500GB Hard Drive
Hdpslv1-3: 12GB RAM, 2 Cores, 1x1TB Hard Drive
The figure below labels all the services each machine will have installed:
Prior to installing Cloudera, there are a few configuration tasks that need to be done on each of the virtual machines. Instead of performing the following tasks on each machine individually, it is recommended to perform them on one machine, then export them as an appliance (shown in the appendix), then re-import it and rename it. This section will assume we are setting up one machine with the intent to export/import and repurpose it.
Downloads & Installs
The following items should be installed if they are not already. Terminal commands will be in green and changes to configuration files will be illustrated in orange.
(This will be used for mounting drives or shares)
[root@ hdpmst1 ~]# yum -y install cifs-utils
[root@ hdpmst1 ~]# yum -y install ntp [root@ hdpmst1 ~]# vi /etc/ntp.conf # line 18: add the network range you allow to receive requests restrict 10.0.0.0 mask 255.255.255.0 nomodify notrap # change servers for synchronization #server 0.centos.pool.ntp.org iburst #server 1.centos.pool.ntp.org iburst #server 2.centos.pool.ntp.org iburst #server 2.centos.pool.ntp.org iburst #server 3.centos.pool.ntp.org iburst server ntp1.jst.mfeed.ad.jp iburst server ntp2.jst.mfeed.ad.jp iburst server ntp3.jst.mfeed.ad.jp iburst [root@ hdpmst1 ~]# chkconfig ntpd on [root@ hdpmst1 ~]# systemctl start ntpd [root@ hdpmst1 ~]# systemctl enable ntpd
Synch node with the NTP server
[root@ hdpmst1 ~]# ntpdate -u <NTP Server Here>
Synch node system clock
[root@ hdpmst1 ~]# hwclock --systoch
Verify that ntpd is running
[root@ hdpmst1 ~]# ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +ntp1.jst.mfeed. 172.X.X.X 2 u 29 64 1 18.826 -0.126 0.000 +ntp2.jst.mfeed. 172.X.X.X 2 u 28 64 1 21.592 0.018 0.000 *ntp3.jst.mfeed. 133.X.X.X 2 u 28 64 1 22.666 -1.033 0.000
The following configurations should be done prior to starting the Cloudera install.
Terminal commands will be in green and changes to configuration files will be illustrated in orange.
Setup the cluster hosts
[root@ hdpmst1 ~]# vi /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # Internal cluster connections # Master Nodes 10.20.5.1 hdpmst1.local hdpmst1 # Slave Nodes 10.20.5.2 hdpslv1.local hdpslv1 10.20.5.3 hdpslv2.local hdpslv2 10.20.5.4 hdpslv3.local hdpslv3
Enter your nameserver(s) here
[root@ hdpmst1 ~]# vi /etc/resolv.conf nameserver 10.0.0.1 nameserver 10.X.X.X
[root@ hdpmst1 ~]# vi /etc/sysconfig/selinux # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. SELINUX=disabled # SELINUXTYPE= can take one of these two values: # targeted - Targeted processes are protected, # mls - Multi Level Security protection. SELINUXTYPE=targeted
HOSTNAME should be changed on each machine once the initial setup machine is cloned. This means, for example, that hdpslv1 would have HOSTNAME=hdpslv1.local.
[root@hdpmst1 ~]# vi /etc/sysconfig/network NETWORKING=yes HOSTNAME=hdpmst1.local
We need to configure both eth0 and eth1 – those configurations are shown here for the initial machine and master node, hdpmst1. Again, it is important to note that the IPADDR and HOSTNAME needs to change once the machine is cloned and renamed for eth1.
[root@ hdpmst1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 TYPE=Ethernet ONBOOT=yes NM_CONTROLLED=yes BOOTPROTO=dhcp HWADDR=08:00:27:98:73:C7 DEFROUTE=yes PEERDNS=yes PEERROUTES=yes IPV4_FAILURE_FATAL=yes IPV6INIT=no NAME=eth0 HOSTNAME=hdpmst1.local
[root@ hdpmst1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth1 DEVICE=eth1 TYPE=Ethernet ONBOOT=yes NM_CONTROLLED=no BOOTPROTO=none DEFROUTE=yes IPV4_FAILURE_FATAL=yes IPV6INIT=no NAME=eth1 PEERDNS=yes PEERROUTES=yes IPADDR=10.20.5.1 NETMASK=255.255.255.0 HOSTNAME=hdpmst1.local
Finally, we need to make sure the machine gets the IP assigned as well as the hostname.
[root@ hdpmst1 ~]# ifconfig eth1 10.20.5.1 netmask 255.255.255.0 [root@ hdpmst1 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:98:73:C7 inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe98:73c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:969718 errors:7 dropped:0 overruns:0 frame:0 TX packets:893778 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:506203964 (482.7 MiB) TX bytes:387911327 (369.9 MiB) Interrupt:19 Base address:0xd020 eth1 Link encap:Ethernet HWaddr 08:00:27:C9:33:E1 inet addr:10.20.5.1 Bcast:10.20.5.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fec9:33e1/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16033954 errors:0 dropped:0 overruns:0 frame:0 TX packets:7389356 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11358234809 (10.5 GiB) TX bytes:2356875364 (2.1 GiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:52942701 errors:0 dropped:0 overruns:0 frame:0 TX packets:52942701 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:106396678158 (99.0 GiB) TX bytes:106396678158 (99.0 GiB) [root@ hdpmst1 ~]# hostname hdpmst1.local [root@ hdpmst1 ~]# hostname -f hdpmst1.local
Turn off iptables and reboot the machine
[root@ hdpmst1 ~]# chkconfig iptables off
Disable Transparent Huge Pages by adding the following lines to the rc.local file. THP are detrimental to the performance of tools such as Hue.
[root@ hdpmst1 ~]# vi /etc/rc.local ...other entries here… # Disable THP for Cloudera echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag
Set vm.swappiness = 1 in the sysctl.conf file. We don’t want swapping during cluster operations plus Cloudera Manager will throw warnings.
[root@ hdpmst1 ~]# vi /etc/sysctl.conf ...other entries here… # Control Swapping Vm.swappiness=1
When you have configured all four machines, your setup should look something like the following within Oracle Virtual box (minus them saying Aborted – hah)
Now that you have 4 machines setup, talking to each other, and pre-configured, you are ready to install Cloudera using the installation bin they provide. These pre-configuration steps are important when setting up any cluster because they save you time versus navigating through the cryptic Cloudera Installer logs when the install does fail. See the next blog in the series (#2) for a detailed install explanation.