Posts Tagged “hadoop”

Parcel Not Distributing Cloudera CDH.

We were deploying one of the cluster on our lab environment which is used by everyone. So the lab has it own share of stale information on it.

Written on March 8, 2017
linux centos redhat cloudera hadoop cluster


Creating /etc/hosts file in Chef.

We had a cluster environment which we needed to update the /etc/hosts file. Which would help communicate between the server over a private network. Our servers have multiple interfaces and we need them to communicate between each other using the private network.

Written on March 3, 2017
linux centos redhat chef hadoop cluster


Enable Kerberos Using Cloudera API.

Python API for cloudera is really nice, apart from getting the cluster setup, we can also do configuration and automation. We use a lot of automation using Chef/Ansible, but cloudera API give more control over the cluster.

Written on February 26, 2017
linux cloudera hadoop cloudera-api kerberos


Setting Up HDFS Services Using Cloudera API [Part 3]

This is the second follow up post. In the earlier post

Written on February 8, 2017
linux cloudera hadoop cloudera-api zookeeper hdfs


Setting Up Zookeeper Services Using Cloudera API [Part 2]

This is the second follow up post. In the earlier post Setting Up Cloudera Manager Services Using Cloudera API [Part 1] we install the cloudera management services. Now we will be installing Zookeeper service to the cluster.

Written on February 2, 2017
linux cloudera hadoop cloudera-api zookeeper


Setting Up Cloudera Manager Services Using Cloudera API [Part 1]

Cloudera API is a very convenient way to setup a cluster and do more.

Written on January 25, 2017
linux cloudera hadoop cloudera-api


Getting Started with Cloudera API

This is a basic steps to get connected with cloudera manager.

Written on January 14, 2017
linux cloudera hadoop cloudera-api


Basic Testing On Hadoop Environment [Cloudera]

These are a set of testing which we can do on a Hadoop environment. These are basic testing to make sure the environment is setup correctly.

Written on January 13, 2017
linux cloudera hadoop testing


Setting Hue to Listen on `0.0.0.0` [Cloudera]

We were working on setting up a cluster, but the Hue URL was set to a private IP of the server. As we had setup all the nodes to access each other using a private IP. But we wanted Hue to bind to public interface so that it can be accessed within the network.

Written on January 12, 2017
linux cloudera hadoop


YUM Repository Creation on HTTPD Web Server.

Setting up yum repos on RHEL using httpd. We will be setting up httpd and yum repo on top of it. So that we can access yum over http.

Written on November 3, 2015
linux hadoop webserver http httpd yum rhel centos


Access Filter Setup with SSSD

If using access_provider = ldap, this option is mandatory. It specifies an LDAP search filter criteria that must be met for the user to be granted access on this host. If access_provider = ldap and this option is not set, it will result in all users being denied access. Use access_provider = allow to change this default behaviour.

Written on October 23, 2015
linux hadoop sssd access-filter rhel centos security


Update Cloudera Manager to specific version [5.4.5]

Upgrading Cloudera Manager 5 to the Latest Cloudera Manager, In most cases it is possible to complete the following upgrade without shutting down most CDH services, although you may need to stop some dependent services. CDH daemons can continue running, unaffected, while Cloudera Manager is upgraded. The upgrade process does not affect your CDH installation. After upgrading Cloudera Manager you may also want to upgrade CDH 5 clusters to CDH 5.4.5 or latest.

Written on October 22, 2015
linux hadoop upgrade cloudera manager


Getting started with Hive with Kerberos.

Apache Hive is a powerful data warehousing application built on top of Hadoop; it enables you to access your data using Hive QL, a language that is similar to SQL. Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster. If Kerberos authentication is used, authentication is supported between the Thrift client and HiveServer2, and between HiveServer2 and secure HDFS.

Written on October 20, 2015
linux hadoop hive kerberos ad ldap cloudera security


Redhat Intergration with Active Directory using SSSD.

There are inherent structural differences between how Windows and Linux handle system users. The user schemas used in Active Directory and standard LDAPv3 directory services also differ significantly. When using an Active Directory identity provider with SSSD to manage system users, it is necessary to reconcile Active Directory-style users to the new SSSD users. There are two ways to achieve it:

Written on October 6, 2015
linux hadoop sssd active-directory ad ldap rhel centos security


Simple Steps to Intergrate RHEL with Active Directory using SSSD.

There are inherent structural differences between how Windows and Linux handle system users. The user schemas used in Active Directory and standard LDAPv3 directory services also differ significantly. When using an Active Directory identity provider with SSSD to manage system users, it is necessary to reconcile Active Directory-style users to the new SSSD users. There are two ways to achieve it:

Written on October 6, 2015
linux hadoop sssd ad active-directory rhel centos security


Setting up Pentaho Data Integration 5.4.1 with Hadoop Cluster (Clouder Manager)

Pentaho Data Integration (PDI) is a suite of open source Business Intelligence (BI) products which provide data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities. Here we are setting up pentaho server has the below steps. Will be referring to Pentaho Data Integration as PDI from now on.

Written on October 6, 2015
linux hadoop pentaho data-integration cloudera manager


No valid credentials provided Mechanism level Failed to find any Kerberos tgt

We have 2 domains forests in our environment, ABC and XYZ. We were not able to authenticate normal users from either of the domains. Most of the information is there on the Cloudera Website. You might want to check on the site first, if you see any thing similar.

Written on October 6, 2015
hadoop kerberos error krb5


Interface Forwarding - from `eth1` to `eth0` on `EDGE` node.

Adding route to all the slaves which reside on a private network to communicate with External Server directly using an EDGE node using Interface Forwarding.

Written on October 6, 2015
linux hadoop interface network


Installing `ansible` on RHEL 6.6.

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs. For more detail, hop over to docs.ansible.com.

Written on October 6, 2015
linux ansible hadoop rhel centos


Ansible Playbook - Setup Zookeeper Using `tarball`.

This is a simple zookeeper playbook, to quickly start zookeeper running on a single or more nodes, in a clustered mode.

Written on June 16, 2015
linux ansible hadoop zookeeper


Ansible Playbook - Setup Storm Cluster.

This is a simple Storm Cluster Setup. We are using a dedicated Zookeeper Cluster/Node, instead of the standalone zkserver. Below is how we will deploy our cluster.

Written on June 16, 2015
linux ansible storm hadoop zookeeper


Ansible Playbook - Setup Kafka Cluster.

This is a simple Kafka setup. In this setup we are running kafka over a dedicated zookeeper service. (NOT the standalone zookeeper which comes with kafka)

Written on June 16, 2015
linux ansible kafka hadoop zookeeper


Ansible Playbook - Setup Hadoop CDH5 Using `tarball`.

Setting up Hadoop using Ansible, we will be using cdh5 tarball for installation of the cluster.

Written on June 16, 2015
linux ansible hadoop tarball


Creating a Multi-node Cassandra Cluster on Centos 6.5.

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Written on March 19, 2015
hadoop linux cassandra centos rhel


Performance Tuning HBase and Hadoop.

Using HBase in production often requires that you turn many knobs to make it hum as expected. More Here http://hbase.apache.org/0.94/book/performance.html

Written on February 13, 2015
linux hadoop hbase performance-tuning


Setting you Hbase Cluster on Hadoop (YARN). Ubuntu 12.04 LTS

HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Written on February 13, 2015
hadoop hbase linux ubuntu yarn


Installing KAFKA Single Node - Quick Start.

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. What does all that mean?

Written on February 6, 2015
linux hadoop kafka quick-start


Hadoop `sysctl.conf` parameters.

Performace tuning Hadoop at kernel level. sysctl is an interface that allows you to make changes to a running Linux kernel. With /etc/sysctl.conf you can configure various Linux networking and system settings such as:

  • Limit network-transmitted configuration for IPv4
  • Limit network-transmitted configuration for IPv6
  • Turn on execshield protection
  • Prevent against the common syn flood attack
  • Turn on source IP address verification
  • Prevents a cracker from using a spoofing attack against the IP address of the server.
  • Logs several types of suspicious packets, such as spoofed packets, source-routed packets, and redirects.
Written on January 28, 2015
hadoop linux kernel sysctl performance tuning


How To Configure Swappiness

Swappiness is a Linux kernel parameter that controls the relative weight given to swapping out runtime memory, as opposed to dropping pages from the system page cache. Swappiness can be set to values between 0 and 100 inclusive. A low value causes the kernel to avoid swapping, a higher value causes the kernel to try to use swap space. The default value is 60, and for most desktop systems, setting it to 100 may affect the overall performance, whereas setting it lower (even 0) may decrease response latency.

Written on January 20, 2015
hadoop linux swappiness swap sysctl config


Setting up Local HBase on Top of HDFS.

First lets setup the Hbase Configuration Files. For pseudo-distributed replace with 'localhost'

Written on January 1, 2015
hadoop hdfs hbase hadoop-config hbase-config


Enable Authorization on HBase.

Add this below tag to all the Master and Region Server.

Written on December 31, 2014
hadoop hbase hadoop-config hbase-config security authentication


Zabbix Hadoop Monitoring.

This script can be used to monitor Namenode Parameters. This script can be used to Generate Zabbix Import XML or Send monitoring data to Zabbix server.

Written on November 25, 2014
zabbix csv-processing linux mib python hadoop xml nagios monitoring