Welcome!

AJAX & REA Authors: Elizabeth White, Plutora Blog, Liz McMillan, Carmen Gonzalez, Sematext Blog

Related Topics: Cloud Expo, Open Source

Cloud Expo: Blog Post

Hadoop Quickstart: Create and Better Manage Hadoop Clusters on Rackspace

Use Whirr to automate standup of your distributed cluster on Rackspace

We have previously provided a Quickstart guide to standing up Rackspace cloud servers (and have one for Amazon servers as well). These are very low cost ways of building reliable, production ready capabilities for enterprise use (commercial and government).  And Bryan Halfpap has provided a Quickstart guide which shows you how to build a Hadoop Cluster (leveraging Cloudera’s CDH3).  Using Bryan’s guide you can have a Hadoop Cluster up and running in under 20 minutes.

With this post we would like to provide you with some additional tips that flow from these other posts. We will show you how to build clusters even faster using another common tool in community use, Whirr.

What is Whirr? Apache Whirr is a set of libraries for running cloud services. Here is more from http://whirr.apache.org/

Whirr provides:

  • A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
  • A common service API. The details of provisioning are particular to the service.
  • Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.

And the great news is you can use Whirr as a command line tool for deploying clusters.

If you follow the tips below you can use Whirr to quickly standup distributed clusters. Our assumptions in this guide are that you have stood up RedHat severs using our Rackspace tutorial. But if this is not the case you should be able to easily modify the tips below to suit your situation.

SSH into your Rackspace account by terminal window:

sudo ssh [email protected]

After logging in, it is always a good idea to make sure you have the latest packages. In Red Hat, type:

sudo yum upgrade

Now it is time to install Whirr.  This is easy since you are running RedHat. RedHat uses YUM, a package management application that makes software installation easy. Type:

yum install whirr

Your installation will be complete in under a minute.

You will now need to generate a keypair for use with Whirr. This will let you enable secure communications with the Whirr cluster without needing passwords. To do that, enter the following command:

ssh-keygen -t rsa -P ”

You will see:

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):

Just hit “enter”.

You will see something like:

Created directory ‘/root/.ssh’.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
c6:31:f7:f5:97:e4:8c:b3:2a:f4:0d:a0:93:e4:c1:06 root@RedHat6-1-Hadoop-Test
The key’s randomart image is:
+–[ RSA 2048]—-+
| |
| E |
| o o . .. |
| * = . .=..|
| + S . .o +o|
| * . . o .|
| o . o. |
| . … |
| .. |
+—————–+

Now you must define your Whirr cluster. you do that by creating a properties file. For simplicity, you will name it hadoop.properties. You will need your rackspace username and API to fill out the whirr properties file.  Your API is found in my account page under “API Access”

You can create the properties file many ways. Here is how to do it in nano:

nano hadoop.properties

Now  enter the following info in that file, subsituting your login and API info for what you see below:

whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+$
whirr.provider=cloudservers-us
whirr.identity=enteryourloginidfromrackspace
whirr.credential=[youuseyourownapi]
whirr.private-key-file=/root/.ssh/id_rsa
whirr.public-key-file=/root/.ssh/id_rsa.pub
whirr.cluster-user=newusers
whirr.hadoop-install-function=install_cdh_hadoop
whirr.hadoop-configure-function=configure_cdh_hadoop

Now to launch a cluster, type:

$ whirr launch-cluster –config hadoop.properties

This will take a few moments to run. As it runs you should see messages like:

Bootstrapping cluster
Configuring template
Starting 1 node(s) with roles [hadoop-datanode]
Configuring template
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]

As things are started up, servers are being automatically built. Keep watching your e-mail, you will be getting notices of server standup. Remember, this is costing you money. When you finish using your clusters you will want to terminate them. You can do that through Whirr or by just nuking the servers using your Rackspace account and control panel.

Note the info being provided in the terminal window. Information is being provided on the instances being stood up.  As you skim this info you will notice a couple URL’s are provided that give you a web UI into the namenode and job tracker. For example, mine are:

Namenode web UI available at http://50.56.211.206:50070

Jobtracker web UI available at http://50.56.211.206:50030

You will also see that a site file was created for you at:

/root/.whirr/myhadoopcluster/hadoop-site.xml

You need to update your your local Hadoop configuration to use this file.  Type the following commands:

cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirr
rm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
alternatives –display hadoop-0.20-conf

A proxy script was created for you at:

/root/.whirr/myhadoopcluster/hadoop-proxy.sh

You should now start that proxy.  It is there for security reasons.  All traffic from the network where your client is running is proxied through the master node of the cluster using an SSH tunnel.  This script launches the proxy.  Run the following command to launch the script:

~/.whirr/myhadoopcluster/hadoop-proxy.sh

If that doesn’t run make sure you have the right permissions on the file by

chmod +rwx hadoop-proxy.sh

Then try again.

With the above you are now able to use your Hadoop Cluster.

Prove that by browsing HDFS:

hadoop fs -ls /

Now it is time to run a MapReduce job!  We are going to use one of the example programs provided in the Hadoop installation. The program is in the file Hadoop-*examples*.jar  First, a lets review list of options available form the program. See these by entering:

hadoop jar $HADOOP_HOME/hadoop-examples-*.jar

You will see:

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.

So lets put this info to use. We will make a directory put some info in there, and run the wordcount program:

 

$ export HADOOP_HOME=/usr/lib/hadoop
$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
$ hadoop fs -cat output/part-* | head

Now you are off and running.

You now have a platform capable of scaling to very large jobs. And it runs CDH3, the most reliable, capable distribution of Hadoop and related technologies. Let the fun begin!

But one final note. Think about the lifecycle of your system. At some time you will need to spin it down and turn it off.  To destroy the cluster gracefully using Whirr, enter this command:

whirr destroy-cluster –config hadoop.properties

Using the information above you can create and better manage Hadoop Clusters on Rackspace very easily. This is how we create our CDH Clusters.  In future posts we will show you how to get data prepared for analysis and how to run some queries.  We will also provide tips on how to use Cloudera’s free management tools and how to upgrade to Cloudera Enterprise when you are ready.

 

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley, former CTO of the Defense Intelligence Agency (DIA), is Founder and CTO of Crucial Point LLC, a technology research and advisory firm providing fact based technology reviews in support of venture capital, private equity and emerging technology firms. He has extensive industry experience in intelligence and security and was awarded an intelligence community meritorious achievement award by AFCEA in 2008, and has also been recognized as an Infoworld Top 25 CTO and as one of the most fascinating communicators in Government IT by GovFresh.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
The move in recent years to cloud computing services and architectures has added significant pace to the application development and deployment environment. When enterprise IT can spin up large computing instances in just minutes, developers can also design and deploy in small time frames that were unimaginable a few years ago. The consequent move toward lean, agile, and fast development leads to the need for the development and operations sides to work very closely together. Thus, DevOps become...
DevOps is all about agility. However, you don't want to be on a high-speed bus to nowhere. The right DevOps approach controls velocity with a tight feedback loop that not only consists of operational data but also incorporates business context. With a business context in the decision making, the right business priorities are incorporated, which results in a higher value creation. In his session at DevOps Summit, Todd Rader, Solutions Architect at AppDynamics, discussed key monitoring techniques...
Advanced Persistent Threats (APTs) are increasing at an unprecedented rate. The threat landscape of today is drastically different than just a few years ago. Attacks are much more organized and sophisticated. They are harder to detect and even harder to anticipate. In the foreseeable future it's going to get a whole lot harder. Everything you know today will change. Keeping up with this changing landscape is already a daunting task. Your organization needs to use the latest tools, methods and ex...
Building low-cost wearable devices can enhance the quality of our lives. In his session at Internet of @ThingsExpo, Sai Yamanoor, Embedded Software Engineer at Altschool, provided an example of putting together a small keychain within a $50 budget that educates the user about the air quality in their surroundings. He also provided examples such as building a wearable device that provides transit or recreational information. He then reviewed the resources available to build wearable devices at ...
The Internet of Things is not new. Historically, smart businesses have used its basic concept of leveraging data to drive better decision making and have capitalized on those insights to realize additional revenue opportunities. So, what has changed to make the Internet of Things one of the hottest topics in tech? In his session at @ThingsExpo, Chris Gray, Director, Embedded and Internet of Things, discussed the underlying factors that are driving the economics of intelligent systems. Discover ...
From telemedicine to smart cars, digital homes and industrial monitoring, the explosive growth of IoT has created exciting new business opportunities for real time calls and messaging. In his session at @ThingsExpo, Ivelin Ivanov, CEO and Co-Founder of Telestax, shared some of the new revenue sources that IoT created for Restcomm – the open source telephony platform from Telestax. Ivelin Ivanov is a technology entrepreneur who founded Mobicents, an Open Source VoIP Platform, to help create, de...
Mobile commerce traffic is surpassing desktop, yet less than 20% of sales in the U.S. are mobile commerce sales. In his session at 15th Cloud Expo, Dan Franklin, Segment Manager, Commerce, at Verizon Digital Media Services, defined mobile devices and discussed how next generation means simplification. It means taking your digital content and turning it into instantly gratifying experiences.
“DevOps is really about the business. The business is under pressure today, competitively in the marketplace to respond to the expectations of the customer. The business is driving IT and the problem is that IT isn't responding fast enough," explained Mark Levy, Senior Product Marketing Manager at Serena Software, in this SYS-CON.tv interview at DevOps Summit, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
IBM has announced software that allows people to hide or anonymize their personal information on the Web, ensuring protection from identity theft and other misuse. Developed by researchers at IBM's laboratory in Zurich, Switzerland, the software – called Identity Mixer – will enable consumers to purchase goods and services on the Internet without disclosing personal information. As consumers hand over personal details in exchange for downloading music or subscribing to online newsletters, they...
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core...
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
Software AG and Wipro Ltd. have announced a joint solution platform for streaming analytics that provides real-time actionable intelligence for the Internet of Things (IoT) market. “The key to successfully addressing the IoT market is the ability to rapidly build and evolve apps that tap into, analyze and make smart decisions on fast, big data”, said John Bates, Global Head of Industry Solutions and CMO, Software AG. To address the huge market potential created by streaming analytics in conj...
Cloud services are the newest tool in the arsenal of IT products in the market today. These cloud services integrate process and tools. In order to use these products effectively, organizations must have a good understanding of themselves and their business requirements. In his session at 15th Cloud Expo, Brian Lewis, Principal Architect at Verizon Cloud, outlined key areas of organizational focus, and how to formalize an actionable plan when migrating applications and internal services to the ...
Today’s enterprise is being driven by disruptive competitive and human capital requirements to provide enterprise application access through not only desktops, but also mobile devices. To retrofit existing programs across all these devices using traditional programming methods is very costly and time consuming – often prohibitively so. In his session at @ThingsExpo, Jesse Shiah, CEO, President, and Co-Founder of AgilePoint Inc., discussed how you can create applications that run on all mobile ...
"SOASTA built the concept of cloud testing in 2008. It's grown from rather meager beginnings to where now we are provisioning hundreds of thousands of servers on a daily basis on behalf of customers around the world to test their applications," explained Tom Lounibos, CEO of SOASTA, in this SYS-CON.tv interview at DevOps Summit, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
"Verizon Digital Media Services is responsible for the broadcast, video and content delivery network that accelerates, scales and helps our customers reach end users with all kinds of video and web content," stated James Segil, CMO of Verizon Digital Media Services, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Axios Systems on Tuesday announced it has selected CenturyLink Cloud as the hosting platform for Axios Systems’ IT Service Management (ITSM) solutions in Canada. CenturyLink, a global provider of communications and IT services, joins other leading technology providers across North America, Europe and APAC to help Axios’ international customers drive efficiencies and innovation across their service management provision. The arrangement with CenturyLink enables Axios to further strengthen its pres...
Docker is becoming very popular--we are seeing every major private and public cloud vendor racing to adopt it. It promises portability and interoperability, and is quickly becoming the currency of the Cloud. In his session at DevOps Summit, Bart Copeland, CEO of ActiveState, discussed why Docker is so important to the future of the cloud, but will also take a step back and show that Docker is actually only one piece of the puzzle. Copeland will outline the bigger picture of where Docker fits a...
The Internet of Things is a misnomer. That implies that everything is on the Internet, and that simply should not be - especially for things that are blurring the line between medical devices that stimulate like a pacemaker and quantified self-sensors like a pedometer or pulse tracker. The mesh of things that we manage must be segmented into zones of trust for sensing data, transmitting data, receiving command and control administrative changes, and peer-to-peer mesh messaging. In his session a...
The speed of product development has increased massively in the past 10 years. At the same time our formal secure development and SDL methodologies have fallen behind. This forces product developers to choose between rapid release times and security. In his session at DevOps Summit, Michael Murray, Director of Cyber Security Consulting and Assessment at GE Healthcare, examined the problems and presented some solutions for moving security into the DevOps lifecycle to ensure that we get fast AND ...