Get Started with Apache Hadoop on Rackspace Cloud

Disclaimer: This article details a process intended for
educational purposes only. This does not deploy a production
environment.

What is Apache Hadoop?

Hadoop is an open source project that provides a platform to store and
process massive amounts of data. Hadoop uses the Map Reduce paradigm to
split large tasks into many smaller chunks and executes them in
parallel. Each of these tasks are executed close to the data in the
Hadoop Distributed File System.

Hadoop use cases

In a very short time, Hadoop has revolutionized almost every business
sector. Actual use cases involving Hadoop include the following scenarios:

  • Analyzing medical data.
  • Analyzing transaction data to detect anomalies and suggest
    fraudulent behavior.
  • Processing high definition images from satellites and detecting
    patterns of geographical change.
  • Processing machine-generated data to identify malware and cyber
    attack patterns.

Objective

This article is used for educational purposes only and provides you with
an example of how to get started with Apache Hadoop in the cloud. It shows you
how to launch a Hadoop cluster starting with two nodes and to grow it up to
64 nodes. The article uses Primary instead of Master to describe the main node.
The process includes the following steps:

  1. Create Cloud Servers through the Cloud Control Panel.
  2. Create Cloud Servers by using scripts.
  3. Install and configure Hadoop.
  4. Run Map Reduce applications.

Prerequisites

The following prerequisites are expected for successful completion:

Hadoop installation process

It can be complicated to manually install and configure Hadoop,
so here are a few tools to make the installation easier. In
particular, the following two projects are useful:

Using these tools, the article demonstrates how to create a Hadoop installation
for the following scenarios:

  • 1 Cloud Server as workstation.
  • 1 Cloud Server as Hadoop Primary node.
  • 1 Cloud Server as Hadoop Worker node.
  • Gradually add up to 63 more Hadoop Worker nodes.

The following sections break up each part of the build into separate tasks.

Set up the server as a workstation

This section builds the workstation that is the launching
point to build the remainder of the Hadoop environment.

Create a cloud server

Log in to the Cloud Control Panel
and create a Cloud Server using a Linux® image. Record the IP address
and password for the server.

Wait for your server to be in an Available state. This server will be used
to create other servers. To begin, SSH into it and run the following commands:

export RACKSPACE_API_USERNAME=<Your Rackspace Cloud account username>
export RACKSPACE_API_KEY=<Your API Key>
export RACKSPACE_VERSION=v2
export RACKSPACE_ENDPOINT=https://dfw.servers.api.rackspacecloud.com/v2
curl -L https://raw.github.com/sacharya/random-scripts/master/knife-rackspace-hadoop/chef-knife-install.sh | bash

Note: for information about how to find your API key, see View and reset your API key.

This installs the Chef server, installs knife-rackspace plugin, uploads the
chef hdp-cookbooks, and configures them to talk to Rackspace Cloud using
your account. You can now use the knife client to interact with Rackspace
Cloud and configure your Hadoop cluster.

Choose the image

You need a CentOS 6.2 image as the base image for the server to install
Hadoop.

IMAGE_ID=`knife rackspace image list | grep 'CentOS 6.2' | awk '{print $1}'`
echo $IMAGE_ID

Choose the flavor

Use a flavor with 4096 MB of RAM for the server.

FLAVOR_ID=`knife rackspace flavor list | grep '4096' | awk '{print $1}'`
echo $FLAVOR_ID

Create your environment

In order not to conflict with other Hadoop clusters within the same
account, create a Chef environment called YourName to create your
Hadoop cluster on. Save this name in an environment variable so
you can reference it later.

ENV_NAME=<YourName>
echo $ENV_NAME

Now run the following commands to setup the environment within Chef.

cp /root/hdp-cookbooks/environments/example.json /root/hdp-cookbooks/environments/$ENV_NAME.json
sed -i "s/example/$ENV_NAME/g" /root/hdp-cookbooks/environments/$ENV_NAME.json
knife environment from file /root/hdp-cookbooks/environments/$ENV_NAME.json

Create a Hadoop primary

This command creates a cloud server with the name,
YourName-hadoopprimary with CentOS 6.2 and 4 GB RAM.

It creates it in the example environment and gives it a role of
hadoop-master. Chef then installs and configures all the components
required to make it a Hadoop Primary node.

knife rackspace server create --server-name $ENV_NAME-hadoopprimary --image $IMAGE_ID --flavor $FLAVOR_ID --environment $ENV_NAME --run-list 'role[hadoop-master]'

Now, copy the hadoopprimary's public IP and password from the output.
Save the IP address in an environment variable to use later.

HADOOP_P_IP=<Hadoop Primary IP>
echo $HADOOP_P_IP

Run the following commands:

Note: Ideally, you shouldn't have to run the code below, but there
is currently a bug in the hdp-cookbooks where the hostname is not
propagated properly. So you have to run this extra step.

ssh root@$HADOOP_P_IP "chef-client && /etc/init.d/hadoop-namenode restart && /etc/init.d/hadoop-jobtracker restart"

Verify that the primary is up by going to the jobtracker at:

https://<Hadoop Primary IP>:50030

Create a Hadoop worker

From your server workstation, execute the following command to create a
Hadoop worker node:

knife rackspace server create --server-name $ENV_NAME-hadoopworker1 --image $IMAGE_ID --flavor $FLAVOR_ID --environment $ENV_NAME --run-list 'role[hadoop-worker]'

Similarly, copy the hadoopworker1's public IP and password at the end.
Save the hadoop worker IP address in an environment variable to
use later.

HADOOP_W1_IP=<Hadoop Worker 1 IP Address>
echo $HADOOP_W1_IP

Run the following command:

ssh root@$HADOOP_W1_IP "chef-client && /etc/init.d/hadoop-datanode restart && /etc/init.d/hadoop-tasktracker restart"

Verify that the worker is running by going to the jobtracker at:

https://<Hadoop Primary IP>:50030

Run a map reduce application

Now, SSH to the HadoopPrimary node that you created previously and run the
following examples:

ssh root@$HADOOP_P_IP
hadoop jar /usr/lib/hadoop/hadoop-examples-1.0.3.15.jar pi 10 1000000

This task runs a simulation to estimate the value of pi based on
sampling.

curl -L "https://raw.github.com/sacharya/random-scripts/master/knife-rackspace-hadoop/wordcount.sh" | bash

This script downloads all of Shakespeare's books from Project Gutenberg, uploads
them to HDFS, and runs a Map Reduce operation run a word count against the text.

Add more nodes

So far, you have created only hadoopworker1. Keep adding more
HadoopWorker nodes by following the same process. Make sure to increment
the hadoopworker number each time. Run and benchmark your application
and see how it performs when the size of the cluster grows.

After you feel comfortable, you can also play with different flavor sizes
and see what works best for your application.

Delete the cluster

If you are done with your computation, you might want to delete the
cluster to free up the resources. To do this, you need the server id of
the server you want to delete.

knife rackspace server list
knife rackspace server delete `knife rackspace server list | grep $HADOOP_P_IP | awk '{print $1}'`
knife rackspace server delete `knife rackspace server list | grep $HADOOP_W1_IP | awk '{print $1}'`

Repeat the process for all the servers in the cluster by replacing
$HADOOP_W1_IP with the IP for the appropriate worker number.

Summary

In this article, you learned how to interact with the cloud using
tools and scripts. You also saw how to get started with Apache
Hadoop on a couple of cloud servers and scale it up with your needs.