Hortonworks and Ubuntu deployment guide

I recently deployed HortonWorks on a small lab environment. 4 virtual machines were used on a Intel Nuc barebone with 16gb memory. I saved the history of the command line and made several screenshots. In case you want to deploy HortonWorks yourself, this might be useful.

For my VMWare ESXi I had to install the open-vm-tools, see below.

sudo apt-get install open-vm-tools

The installation requires a root account or an account with enough privileges. I used the root account option. For Ubuntu you need to enable this first:

sudo nano /etc/ssh/sshd_config
PermitRootLogin yes

After changing the sshd_config, restart ssh and change the password:

sudo service ssh restart
sudo passwd root

Next thing I have done is that I added all the servers to the hosts file on the first machine I wanted to use for installation. You can use DNS for this or the hosts file. In my small lab environment I went for the hosts file.

nano /etc/hosts

Add the following lines of code;

127.0.1.1 hadoop01
192.168.0.162 hadoop01
192.168.0.163 hadoop02
192.168.0.164 hadoop03
192.168.0.165 hadoop04

Next thing is that we need to be able to access all machines from the first node. Create a ssh key for this first:

ssh-keygen

Now copy the ssh key to all authorized_keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/id_rsa.pub | ssh root@hadoop02 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh root@hadoop03 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh root@hadoop04 "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

HortonWorks requires transparent hugepages on each server. Disable this setting by doing the following:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

We also need NTP on every machine. Install NTP by:

sudo apt-get install ntp

Now we’re ready to start the installation. Add the HortonWorks ambari repository:

wget -nv http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/2.2.2.0/ambari.list -O /etc/apt/sources.list.d/ambari.list
apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD

Update everything and install ambari server:

apt-get update
apt-get install ambari-server

Now you can run the initial setup. Use the following command line below and see how I answered the questions below:

ambari-server setup

Customize user account for ambari-server daemon [y/n] -> n
Checking JDK… -> [1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8
Do you accept the Oracle Binary Code License Agreement [y/n] -> y
Enter advanced database configuration [y/n] (n)? -> n

Accept the Oracle JDK license when prompted. You must accept this license to download the necessary JDK from Oracle. The JDK is installed during the deploy phase.

Select n at Enter advanced database configuration to use the default, embedded PostgreSQL database for Ambari. The default PostgreSQL database name is ambari. The default user name and password are ambari/bigdata. Otherwise, to use an existing PostgreSQL, MySQL or Oracle database with Ambari -> y

Now you are ready to start the server. Use the following command:

ambari-server start

Navigate to the 8080 port on the ambari server. In my case I used http://192.168.0.162:8080/

Use the admin/admin combination, see below:

horton01

Next step is that we want to launch the install wizard, use this button.

horton02

Give the cluster a name. In my case I used the name “hadoop”.

horton03

Select the distribution version. I used the latest HDP 2.4

horton04

Select the nodes you want to install on. I used all four nodes, see below. For the communication I had to copy paste the ssh key into this screen. Use the command below and copy paste the entire ssh key into this field:

cat .ssh/id_rsa
—–BEGIN RSA PRIVATE KEY—–
MIIEpAIBAAKCAQEAxD2NeUv21QX2240Nys+sWnZbG7oERcmADtfm3WWSR0Tdt6WF
If9TLyx/uFTnhOXkMe8SyEzae4zd81XZG8V8wDb5g/Ubdey4s+0ykiTpg4dKUTiT
moN5M3+Z34VjnDaSVhMH9Bi9wWNwYKagjaK/KSGsxtfUSK6Pu+5IPzBctjV5R0Yd
KoxuXAxetpsx8HeyCaJntXrl7HBvq4bqM3CxNRoXCQTYTrZq2i+hiOcOaK/ZytyX

horton05

Confirm and start to install:

horton06

When ready, select the services you want to use.

 

horton09

Next thing is to assign the masters. The first node I used as namenode, zookeeper, atlas and grafana. For the next snamenode, history, app timeline server, etc. You can divide them all to one server, but at least make sure you have enough memory.

horton10

Next step is to assign the slaves and clients. I made all hosts a data node and node manager. You might to make an exception for the first node.

horton11

Next step is for hive to create a new mysql database. This is where all the management information will be stored on.

horton12

I had to type a password for grafana in order to complete my installation:

horton13

Review and finish the installation:

horton14

All the packages will be deployed:

horton15

horton16

After the installation the admin user was not able to connect to the HDFS client. In order to do so, switch to the hdfs system account user.

su - hdfs
hadoop fs -mkdir /user/admin

Set the ownership on the newly created directory:

hadoop fs -chown admin:hadoop /user/admin

 

Synology data scrubbing speedup

My NAS sometimes needs to perform a parity consistency check. To speed up this process you the following command:

echo 190000 > /proc/sys/dev/raid/speed_limit_min

The process should be now 10 times quicker!

 

[WordPress] filter out unneeded menu classes

Here’s a code snippet for WordPress to filter out unneeded menu classes. Put the following line in your functions.php

// Reduce nav classes, leaving only 'current-menu-item'
function nav_class_filter($var)
{
    return is_array($var) ? array_intersect($var, array(
        'menu',
        'menu-main',
        'menu-primary',
        'menu-item',
        'sub-menu',
        'menu-last-item',
        'menu-first-item',
        'menu-noparent',
        'menu-parent',
        'menu-top',
        'current-menu-item'
    )) : '';
}
add_filter('nav_menu_css_class', 'nav_class_filter', 100, 1);
add_filter('nav_menu_item_id', 'my_css_attributes_filter', 100, 1);
 

[WordPress] set cookie

Here’s a code snippet for WordPress. Since WordPress doesn’t support any sessions a cookie might be useful. Here’s a code snippet to use for your functions.php

//Set cookie
function set_newuser_cookie() {
	if (!isset($_COOKIE['sitename_newvisitor'])) {
		setcookie('sitename_newvisitor', 1, time()+1209600, COOKIEPATH, COOKIE_DOMAIN, false);
	}
}
add_action( 'init', 'set_newuser_cookie');
 

[WordPress & Genesis] Add parent and child classses to menu

Here’s another code snippet for WordPress Genesis Framework.

If you would like to add menu classes to parent and child menu’s, use the code below.
Put this in the functions.php file:


// Function to add parent and child classses to menu
class Arrow_Walker_Nav_Menu extends Walker_Nav_Menu
{
    function display_element($element, &$children_elements, $max_depth, $depth = 0, $args, &$output)
    {
        $id_field = $this->db_fields['id'];
        if (0 == $depth) {
            $element->classes[] = 'menu-top'; //top main menu
            if (empty($children_elements[$element->$id_field])) {
                $element->classes[] = 'menu-noparent'; //no childs
            }
        }
        if (!empty($children_elements[$element->$id_field])) {
            $element->classes[] = 'menu-parent'; //child in menu
        }
        Walker_Nav_Menu::display_element($element, $children_elements, $max_depth, $depth, $args, $output);
    }
}
 

[WordPress & Genesis] Add menu classes to first en last menu items

Here’s a code snippet for WordPress Genesis Framework.

If you would like to add menu classes to first en last menu items, use the code below. Put this in the functions.php file:

// Function to add menu classes to first en last menu items
function add_first_and_last($items)
{
    $items[1]->classes[]             = 'menu-first-item';
    $items[count($items)]->classes[] = 'menu-last-item';
    return $items;
}
add_filter('wp_nav_menu_objects', 'add_first_and_last');
 

[WordPress & Genesis] custom viewport

Here’s a code snippet for WordPress Genesis Framework.

If you would like to use a custom viewport for mobile devices, for example, use the code below:

/** Add Viewport meta tag for mobile browsers */
add_action('genesis_meta', 'add_viewport_meta_tag');
function add_viewport_meta_tag()
{
    echo '<meta name="viewport" content="width=1020">';
}
 

[WordPress & Genesis] custom footer or custom header

Here’s a code snippet for WordPress Genesis. If you would like to use a custom header or custom footer, don’t use any header.php of footer.php. Use the code snippet below:

// Include Header, seperate from functions
include 'custom-header.php';

// Include Footer, seperate from functions
include 'custom-footer.php';
 

[Synology] How to secure photostation with htaccess

Here’s a short instruction on how to protect your synology photostation by using htaccess:

Create the following file: /volume1/@appstore/PhotoStation/photo/.htaccess

AuthName "Restricted Area"
AuthType Basic
AuthUserFile /volume1/@appstore/PhotoStation/photo/.htpasswd
AuthGroupFile /dev/null
require valid-user

and the following file: /volume1/@appstore/PhotoStation/photo/.passwd

admin:xxxxxxxxxxxxxxxx

Use the passwd htaccess online generate for your own password.

 

Synology: Monitoring Apache with mod_status

Here another quick manual. If you would like to monitor your apache webserver, you can do that with mod_status. Open the httpd.conf-user file:

pico /usr/syno/apache/conf/httpd.conf-user

Copy paste the following content from below:

<Location /server-status>
   SetHandler server-status
   Order Deny,Allow
   Deny from all
   Allow from all
</Location>

Save the httpd.conf-user file and restart apache with:

/usr/syno/etc/rc.d/S97apache-user.sh restart

You can now obtain the apache server status by querying the following url:

http://diskstation/server-status?auto

This might be useful for people working with cacti.

 

Synology: Run sabnzbd behind apache

Here’s a quick instruction for those who would like to run sabnzbd behind apache on a synology nas system. Open up a ssh connection, create the following file:

nano /usr/syno/etc/sites-enabled-user/sabnzbd.conf

Copy paste the contents below to this file. Please note that my sabnzbd port is 9090. If you would like to change this, change the config file below.

# Put this after the other LoadModule directives
LoadModule proxy_module /usr/syno/apache/modules/mod_proxy.so
LoadModule proxy_http_module /usr/syno/apache/modules/mod_proxy_http.so

<Location /sabnzbd>
order deny,allow
deny from all
allow from all
ProxyPass http://localhost:9090/sabnzbd
ProxyPassReverse http://localhost:9090/sabnzbd
</Location>

Save the file by using control-x. Restart apache with the following command:

/usr/syno/etc/rc.d/S97apache-user.sh restart
 

Restore / Extract Plesk 9.5.4 backup

If you want to do a manual restore of your Plesk 9.5.4 backup you should use the following commando:

cat plesk-backup_1205270308.tar* | tar xvf -

The domains, vhosts and databases are found in the clients, domains and resellers folders.

 

updatedb on Synology

If you would like to index your Synology file system you should install the mlocate package from optware

Install then the  mlocate package with the following command:

ipkg install mlocate

To update your filesystem use:

updatedb

To then search use the locate command.

 

Environment variable HADOOP_CMD must be set before loading package rhdfs

Mocht je onder Ubuntu met R onverhoop de volgende melding krijgen:

Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
  call: fun(libname, pkgname)
  error: Environment variable HADOOP_CMD must be set before loading package rhdfs

Probeer dan de volgende regel toe te voegen aan het /etc/environment bestand:

HADOOP_CMD="/usr/local/hadoop/bin/hadoop"

Wellicht lost dit het bovenstaande probleem op!

 

Tuning Mapreduce Jobs

Mocht je met mapreduce willen tunen dan is het handig om een aantal parameters in de gaten te houden. De volgende parameters kunnen van belangrijk zijn:

mapred.tasktracker.map.tasks.maximum = The maximum number of map tasks that will be run simultaneously by a task tracker.
mapred.tasktracker.reduce.tasks.maximum = The maximum number of reduce tasks that will be run simultaneously by a task tracker.
mapred.reduce.tasks = The default number of reduce tasks per job.
mapred.map.tasks = The default number of map tasks per job. Ignored when mapred.job.tracker is “local”.

Deze zijn alle te configureren in het mapred-site.xml bestand. Hier een voorbeeld:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
  <name>mapred.job.tracker</name>
  <value>node1:54311</value>
</property>
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
</property>
<property>
  <name>mapred.reduce.tasks</name>
  <value>10</value>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>10</value>
</property>
</configuration>

Een korte test kan je bijvoorbeeld laten uitvoeren door PI te laten uitrekenen door Hadoop:

hadoop jar /usr/local/hadoop/hadoop-examples-1.0.3.jar pi 10 10
hadoop dfs -rmr /user/hduser/PiEstimator_TMP_3_141592654

Meer informatie over tuning is hier te vinden: Pro Hadoop Ch. 6