|By Bob Gourley||
|March 5, 2012 03:30 AM EST||
We have previously provided a Quickstart guide to standing up Rackspace cloud servers (and have one for Amazon servers as well). These are very low cost ways of building reliable, production ready capabilities for enterprise use (commercial and government). And Bryan Halfpap has provided a Quickstart guide which shows you how to build a Hadoop Cluster (leveraging Cloudera’s CDH3). Using Bryan’s guide you can have a Hadoop Cluster up and running in under 20 minutes.
With this post we would like to provide you with some additional tips that flow from these other posts. We will show you how to build clusters even faster using another common tool in community use, Whirr.
What is Whirr? Apache Whirr is a set of libraries for running cloud services. Here is more from http://whirr.apache.org/
- A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
- A common service API. The details of provisioning are particular to the service.
- Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
And the great news is you can use Whirr as a command line tool for deploying clusters.
If you follow the tips below you can use Whirr to quickly standup distributed clusters. Our assumptions in this guide are that you have stood up RedHat severs using our Rackspace tutorial. But if this is not the case you should be able to easily modify the tips below to suit your situation.
SSH into your Rackspace account by terminal window:
sudo ssh [email protected]
After logging in, it is always a good idea to make sure you have the latest packages. In Red Hat, type:
sudo yum upgrade
Now it is time to install Whirr. This is easy since you are running RedHat. RedHat uses YUM, a package management application that makes software installation easy. Type:
yum install whirr
Your installation will be complete in under a minute.
You will now need to generate a keypair for use with Whirr. This will let you enable secure communications with the Whirr cluster without needing passwords. To do that, enter the following command:
ssh-keygen -t rsa -P ”
You will see:
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Just hit “enter”.
You will see something like:
Created directory ‘/root/.ssh’.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
The key’s randomart image is:
+–[ RSA 2048]—-+
| E |
| o o . .. |
| * = . .=..|
| + S . .o +o|
| * . . o .|
| o . o. |
| . … |
| .. |
Now you must define your Whirr cluster. you do that by creating a properties file. For simplicity, you will name it hadoop.properties. You will need your rackspace username and API to fill out the whirr properties file. Your API is found in my account page under “API Access”
You can create the properties file many ways. Here is how to do it in nano:
Now enter the following info in that file, subsituting your login and API info for what you see below:
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+$
Now to launch a cluster, type:
$ whirr launch-cluster –config hadoop.properties
This will take a few moments to run. As it runs you should see messages like:
Starting 1 node(s) with roles [hadoop-datanode]
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]
As things are started up, servers are being automatically built. Keep watching your e-mail, you will be getting notices of server standup. Remember, this is costing you money. When you finish using your clusters you will want to terminate them. You can do that through Whirr or by just nuking the servers using your Rackspace account and control panel.
Note the info being provided in the terminal window. Information is being provided on the instances being stood up. As you skim this info you will notice a couple URL’s are provided that give you a web UI into the namenode and job tracker. For example, mine are:
Namenode web UI available at http://188.8.131.52:50070
Jobtracker web UI available at http://184.108.40.206:50030
You will also see that a site file was created for you at:
You need to update your your local Hadoop configuration to use this file. Type the following commands:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirrrm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
alternatives –display hadoop-0.20-conf
A proxy script was created for you at:
You should now start that proxy. It is there for security reasons. All traffic from the network where your client is running is proxied through the master node of the cluster using an SSH tunnel. This script launches the proxy. Run the following command to launch the script:
If that doesn’t run make sure you have the right permissions on the file by
chmod +rwx hadoop-proxy.sh
Then try again.
With the above you are now able to use your Hadoop Cluster.
Prove that by browsing HDFS:
hadoop fs -ls /
Now it is time to run a MapReduce job! We are going to use one of the example programs provided in the Hadoop installation. The program is in the file Hadoop-*examples*.jar First, a lets review list of options available form the program. See these by entering:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
You will see:
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
So lets put this info to use. We will make a directory put some info in there, and run the wordcount program:
$ export HADOOP_HOME=/usr/lib/hadoop$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
$ hadoop fs -cat output/part-* | head
Now you are off and running.
You now have a platform capable of scaling to very large jobs. And it runs CDH3, the most reliable, capable distribution of Hadoop and related technologies. Let the fun begin!
But one final note. Think about the lifecycle of your system. At some time you will need to spin it down and turn it off. To destroy the cluster gracefully using Whirr, enter this command:
whirr destroy-cluster –config hadoop.properties
Using the information above you can create and better manage Hadoop Clusters on Rackspace very easily. This is how we create our CDH Clusters. In future posts we will show you how to get data prepared for analysis and how to run some queries. We will also provide tips on how to use Cloudera’s free management tools and how to upgrade to Cloudera Enterprise when you are ready.
- The Odd Couple: Marrying Agile and Waterfall
- The Butterfly Effect Within IT
- The Agile PMO
- The Linux Foundation's Core Infrastructure Initiative Announces New Backers, First Projects to Receive Support and Advisory Board Members
- How to Strategically Benefit from AANPM
- Zuora Caps Record Breaking Subscribed 2014 with the 63rd Release of the Award Winning Z-Business Platform
- Web Service Monitoring 101: Identifying Bad Deployments
- How to Monitor Swift/iOS8 Applications for Crashes and Performance Issues
- How to Monitor the Web Performance of Your Competition
- Complete Surface Pro 3 Review - 3 days later
- "Cloud Computing 2.0" -- I'm Serious
- Storage Made Easy Brings Private Enterprise File Share and Sync for Amazon S3 to AWS GovCloud
- The Odd Couple: Marrying Agile and Waterfall
- Flexera Software’s InstallShield 2014 Release Introduces New Support of Cloud and Virtualised Installations, High-DPI Displays and Touch Devices, and Agile Development
- FlexNet Manager Suite Wins CODiE Award for Best Asset Management Solution - 4th CODiE Award for Flexera Software
- Adobe Releases Update to Patch Zero Day Vulnerability
- The Butterfly Effect Within IT
- As Registration Deadline Nears, dmexco 2014 Is More Popular Than Ever
- Software Quality Metrics for Your Continuous Delivery Pipeline | Part 2
- Atmel Extends Industry Leading maXTouch T Series with the Market's Highest Performance Touchscreen Controller
- The Problem with Cloud SLAs
- New Photo Sharing Service izitru Establishes Photoshop-Free Zone
- Emulex and Compuware Team Up to Drastically Improve Data Centre Problem Identification and Resolution
- The Agile PMO
- Building a Drag-and-Drop Shopping Cart with AJAX
- What Is AJAX?
- Google Maps! AJAX-Style Web Development Using ASP.NET
- Where Are RIA Technologies Headed in 2008?
- Dolphin Announces Open API With Over 50 Add-ons Including Dropbox and Wikipedia
- How and Why AJAX, Not Java, Became the Favored Technology for Rich Internet Applications
- Flashback to January 2006: Exclusive SYS-CON.TV Interviews on "OpenAjax Alliance" Announcement
- "Real-World AJAX" One-Day Seminar Arrives in Silicon Valley
- AJAXWorld Conference & Expo to Take Place October 2-4, 2006, at the Santa Clara Convention Center, California
- AJAX Sponsor Webcasts Are Now Available at AJAXWorld Website
- AJAXWorld University Announces AJAX Developer Bootcamp
- i-Technology 2008 Predictions: Where's RIAs, AJAX, SOA and Virtualization Headed in 2008?
Cloud Computing is evolving into a Big Three of Amazon Web Services, Google Cloud, and Microsoft Azure. Cloud 360: Multi-Cloud Bootcamp, being held Nov 4–5, 2014, in conjunction with 15th Cloud Expo in Santa Clara, CA, delivers a real-world demonstration of how to deploy and configure a scalable and available web application on all three platforms. The Cloud 360 Bootcamp, led by Janakiram MSV, an analyst with Gigaom Research, is the first bootcamp that introduces the core concepts of Infrastructure as a Service (IaaS) based on the workings of the Big Three platforms – Amazon EC2, Google Compute Engine, and Azure VMs. Bootcamp attendees will get to see the big picture and also receive the knowledge needed to make the best cloud decisions for their business applications and entire enterprise IT organization.
Jul. 22, 2014 11:00 AM EDT Reads: 848
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
Jul. 22, 2014 09:00 AM EDT Reads: 1,026
The Internet of Things is a natural complement to the cloud and related technologies such as Big Data, analytics, and mobility. In his session at Internet of @ThingsExpo, Joe Weinman will lay out four generic strategies – digital disciplines – to exploit emerging digital technologies for strategic advantage. Joe Weinman has held executive leadership positions at Bell Labs, AT&T, Hewlett-Packard, and Telx, in areas such as corporate strategy, business development, product management, operations, and R&D.
Jul. 21, 2014 11:17 AM EDT Reads: 1,682
SYS-CON Events announced today that DevOps.com has been named “Media Sponsor” of SYS-CON's “DevOps Summit at Cloud Expo,” which will take place on June 10–12, 2014, at the Javits Center in New York City, New York. DevOps.com is where the world meets DevOps. It is the largest collection of original content relating to DevOps on the web today Featuring up-to-the-minute news, feature stories, blogs, bylined articles and more, DevOps.com is where the thought leaders of the DevOps movement make their ideas known.
Jul. 20, 2014 03:00 PM EDT Reads: 1,500
There are 182 billion emails sent every day, generating a lot of data about how recipients and ISPs respond. Many marketers take a more-is-better approach to stats, preferring to have the ability to slice and dice their email lists based numerous arbitrary stats. However, fundamentally what really matters is whether or not sending an email to a particular recipient will generate value. Data Scientists can design high-level insights such as engagement prediction models and content clusters that allow marketers to cut through the noise and design their campaigns around strong, predictive signals, rather than arbitrary statistics. SendGrid sends up to half a billion emails a day for customers such as Pinterest and GitHub. All this email adds up to more text than produced in the entire twitterverse. We track events like clicks, opens and deliveries to help improve deliverability for our customers – adding up to over 50 billion useful events every month. While SendGrid data covers only abo...
Jul. 20, 2014 02:00 PM EDT Reads: 2,133
SYS-CON Events announced today that the Web Host Industry Review has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Since 2000, The Web Host Industry Review has made a name for itself as the foremost authority of the Web hosting industry providing reliable, insightful and comprehensive news, reviews and resources to the hosting community. TheWHIR Blogs provides a community of expert industry perspectives. The Web Host Industry Review Magazine also offers a business-minded, issue-driven perspective of interest to executives and decision-makers. WHIR TV offers on demand web hosting video interviews and web hosting video features of the key persons and events of the web hosting industry. WHIR Events brings together like-minded hosting industry professionals and decision-makers in local communities. TheWHIR is an iNET Interactive property.
Jul. 20, 2014 09:15 AM EDT Reads: 1,593
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying "faint signals" from the alpha geeks who are creating the future. An active participant in the technology community, the company has a long history of advocacy, meme-making, and evangelism.
Jul. 19, 2014 10:00 AM EDT Reads: 1,689
SYS-CON Events announced today that Verizon has been named “Gold Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Verizon Enterprise Solutions creates global connections that generate growth, drive business innovation and move society forward. With industry-specific solutions and a full range of global wholesale offerings provided over the company's secure mobility, cloud, strategic networking and advanced communications platforms, Verizon Enterprise Solutions helps open new opportunities around the world for innovation, investment and business transformation. Visit verizonenterprise.com to learn more.
Jul. 18, 2014 11:00 AM EDT Reads: 1,678
SYS-CON Events announced today that TMCnet has been named “Media Sponsor” of SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Technology Marketing Corporation (TMC) is the world's leading business to business and integrated marketing media company, servicing niche markets within the communications and technology industries.
Jul. 15, 2014 04:21 PM EDT Reads: 1,147
"In my session I spoke about enterprise cloud analytics and how we can leverage analytics as a service," explained Ajay Budhraja, CTO at the Department of Justice, in this SYS-CON.tv interview at the 14th International Cloud Expo®, held June 10-12, 2014, at the Javits Center in New York City. Cloud Expo® 2014 Silicon Valley, November 4–6, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading Cloud industry players in the world.
Jul. 15, 2014 10:15 AM EDT Reads: 1,756
“We are starting to see people move beyond the commodity cloud and enterprises need to start focusing on additional value added services in order to really drive their adoption," explained Jason Mondanaro, Director of Product Management at MetraTech, in this SYS-CON.tv interview at the 14th International Cloud Expo®, held June 10-12, 2014, at the Javits Center in New York City. Cloud Expo® 2014 Silicon Valley, November 4–6, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading Cloud industry players in the world.
Jul. 15, 2014 09:45 AM EDT Reads: 1,728
"We are automated capacity control software, which basically looks at all the supply and demand and running a virtual cloud environment and does a deep analysis of that and says where should things go," explained Andrew Hillier, Co-founder & CTO of CiRBA, in this SYS-CON.tv interview at the 14th International Cloud Expo®, held June 10-12, 2014, at the Javits Center in New York City. Cloud Expo® 2014 Silicon Valley, November 4–6, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading Cloud industry players in the world.
Jul. 15, 2014 09:45 AM EDT Reads: 1,929
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity. In his session at Internet of @ThingsExpo, Mac Devine, Distinguished Engineer at IBM, will discuss bringing these three elements together via Systems of Discover.
Jul. 15, 2014 08:00 AM EDT Reads: 2,062
The Internet of Things promises to transform businesses (and lives), but navigating the business and technical path to success can be difficult to understand. In his session at 15th Internet of @ThingsExpo, Chad Jones, Vice President, Product Strategy of LogMeIn's Xively IoT Platform, will show you how to approach creating broadly successful connected customer solutions using real world business transformation studies including New England BioLabs and more.
Jul. 14, 2014 09:00 AM EDT Reads: 2,100
All too many discussions about DevOps conclude that the solution is an all-purpose player: developer and operations guru, complete with pager for round-the-clock duty. For most organizations that is not the way forward. In his session at DevOps Summit, Bernard Golden, Vice President of Strategy at ActiveState, will discuss how to achieve the agility and speed of end-to-end automation without requiring an organization stocked with Supermen and Superwomen.
Jul. 14, 2014 08:45 AM EDT Reads: 1,796