Get Started with Apache Hadoop on Rackspace Cloud
Disclaimer: This article details a process intended for
educational purposes only. This does not deploy a production
environment.
What is Apache Hadoop?
Hadoop is an open source project that provides a platform to store and
process massive amounts of data. Hadoop uses the Map Reduce paradigm to
split large tasks into many smaller chunks and executes them in
parallel. Each of these tasks are executed close to the data in the
Hadoop Distributed File System.
Hadoop use cases
In a very short time, Hadoop has revolutionized almost every business
sector. Actual use cases involving Hadoop include the following scenarios:
- Analyzing medical data.
- Analyzing transaction data to detect anomalies and suggest
fraudulent behavior. - Processing high definition images from satellites and detecting
patterns of geographical change. - Processing machine-generated data to identify malware and cyber
attack patterns.
Objective
This article is used for educational purposes only and provides you with
an example of how to get started with Apache Hadoop in the cloud. It shows you
how to launch a Hadoop cluster starting with two nodes and to grow it up to
64 nodes. The article uses Primary instead of Master to describe the main node.
The process includes the following steps:
- Create Cloud Servers through the Cloud Control Panel.
- Create Cloud Servers by using scripts.
- Install and configure Hadoop.
- Run Map Reduce applications.
Prerequisites
The following prerequisites are expected for successful completion:
- Rackspace Cloud account (https://cart.rackspace.com/cloud/).
- SSH client (Windows users can download PUTTY:
https://www.chiark.greenend.org.uk/~sgtatham/putty/download.html). - Basic knowledge of Linux.
- Basic knowledge of Hadoop.
Hadoop installation process
It can be complicated to manually install and configure Hadoop,
so here are a few tools to make the installation easier. In
particular, the following two projects are useful:
- Chef recipes for Hadoop
(https://github.rackspace.com/CloudBigData/cookbook-hadoop-conf) - Knife plugin for the Rackspace API
(https://github.com/opscode/knife-rackspace)
Using these tools, the article demonstrates how to create a Hadoop installation
for the following scenarios:
- 1 Cloud Server as workstation.
- 1 Cloud Server as Hadoop Primary node.
- 1 Cloud Server as Hadoop Worker node.
- Gradually add up to 63 more Hadoop Worker nodes.
The following sections break up each part of the build into separate tasks.
Set up the server as a workstation
This section builds the workstation that is the launching
point to build the remainder of the Hadoop environment.
Create a cloud server
Log in to the Cloud Control Panel
and create a Cloud Server using a Linux® image. Record the IP address
and password for the server.
Wait for your server to be in an Available
state. This server will be used
to create other servers. To begin, SSH into it and run the following commands:
export RACKSPACE_API_USERNAME=<Your Rackspace Cloud account username>
export RACKSPACE_API_KEY=<Your API Key>
export RACKSPACE_VERSION=v2
export RACKSPACE_ENDPOINT=https://dfw.servers.api.rackspacecloud.com/v2
curl -L https://raw.github.com/sacharya/random-scripts/master/knife-rackspace-hadoop/chef-knife-install.sh | bash
Note: for information about how to find your API key, see View and reset your API key.
This installs the Chef server, installs knife-rackspace plugin, uploads the
chef hdp-cookbooks, and configures them to talk to Rackspace Cloud using
your account. You can now use the knife client to interact with Rackspace
Cloud and configure your Hadoop cluster.
Choose the image
You need a CentOS 6.2 image as the base image for the server to install
Hadoop.
IMAGE_ID=`knife rackspace image list | grep 'CentOS 6.2' | awk '{print $1}'`
echo $IMAGE_ID
Choose the flavor
Use a flavor with 4096 MB of RAM for the server.
FLAVOR_ID=`knife rackspace flavor list | grep '4096' | awk '{print $1}'`
echo $FLAVOR_ID
Create your environment
In order not to conflict with other Hadoop clusters within the same
account, create a Chef environment called YourName
to create your
Hadoop cluster on. Save this name in an environment variable so
you can reference it later.
ENV_NAME=<YourName>
echo $ENV_NAME
Now run the following commands to setup the environment within Chef.
cp /root/hdp-cookbooks/environments/example.json /root/hdp-cookbooks/environments/$ENV_NAME.json
sed -i "s/example/$ENV_NAME/g" /root/hdp-cookbooks/environments/$ENV_NAME.json
knife environment from file /root/hdp-cookbooks/environments/$ENV_NAME.json
Create a Hadoop primary
This command creates a cloud server with the name,
YourName-hadoopprimary with CentOS 6.2 and 4 GB RAM.
It creates it in the example environment and gives it a role of
hadoop-master. Chef then installs and configures all the components
required to make it a Hadoop Primary node.
knife rackspace server create --server-name $ENV_NAME-hadoopprimary --image $IMAGE_ID --flavor $FLAVOR_ID --environment $ENV_NAME --run-list 'role[hadoop-master]'
Now, copy the hadoopprimary's public IP and password from the output.
Save the IP address in an environment variable to use later.
HADOOP_P_IP=<Hadoop Primary IP>
echo $HADOOP_P_IP
Run the following commands:
Note: Ideally, you shouldn't have to run the code below, but there
is currently a bug in the hdp-cookbooks where the hostname is not
propagated properly. So you have to run this extra step.
ssh root@$HADOOP_P_IP "chef-client && /etc/init.d/hadoop-namenode restart && /etc/init.d/hadoop-jobtracker restart"
Verify that the primary is up by going to the jobtracker at:
https://<Hadoop Primary IP>:50030
Create a Hadoop worker
From your server workstation, execute the following command to create a
Hadoop worker node:
knife rackspace server create --server-name $ENV_NAME-hadoopworker1 --image $IMAGE_ID --flavor $FLAVOR_ID --environment $ENV_NAME --run-list 'role[hadoop-worker]'
Similarly, copy the hadoopworker1's public IP and password at the end.
Save the hadoop worker IP address in an environment variable to
use later.
HADOOP_W1_IP=<Hadoop Worker 1 IP Address>
echo $HADOOP_W1_IP
Run the following command:
ssh root@$HADOOP_W1_IP "chef-client && /etc/init.d/hadoop-datanode restart && /etc/init.d/hadoop-tasktracker restart"
Verify that the worker is running by going to the jobtracker at:
https://<Hadoop Primary IP>:50030
Run a map reduce application
Now, SSH to the HadoopPrimary node that you created previously and run the
following examples:
ssh root@$HADOOP_P_IP
hadoop jar /usr/lib/hadoop/hadoop-examples-1.0.3.15.jar pi 10 1000000
This task runs a simulation to estimate the value of pi based on
sampling.
curl -L "https://raw.github.com/sacharya/random-scripts/master/knife-rackspace-hadoop/wordcount.sh" | bash
This script downloads all of Shakespeare's books from Project Gutenberg, uploads
them to HDFS, and runs a Map Reduce operation run a word count against the text.
Add more nodes
So far, you have created only hadoopworker1. Keep adding more
HadoopWorker nodes by following the same process. Make sure to increment
the hadoopworker number each time. Run and benchmark your application
and see how it performs when the size of the cluster grows.
After you feel comfortable, you can also play with different flavor sizes
and see what works best for your application.
Delete the cluster
If you are done with your computation, you might want to delete the
cluster to free up the resources. To do this, you need the server id of
the server you want to delete.
knife rackspace server list
knife rackspace server delete `knife rackspace server list | grep $HADOOP_P_IP | awk '{print $1}'`
knife rackspace server delete `knife rackspace server list | grep $HADOOP_W1_IP | awk '{print $1}'`
Repeat the process for all the servers in the cluster by replacing
$HADOOP_W1_IP
with the IP for the appropriate worker number.
Summary
In this article, you learned how to interact with the cloud using
tools and scripts. You also saw how to get started with Apache
Hadoop on a couple of cloud servers and scale it up with your needs.
Updated 4 months ago