|By Bob Gourley||
|March 5, 2012 03:30 AM EST||
We have previously provided a Quickstart guide to standing up Rackspace cloud servers (and have one for Amazon servers as well). These are very low cost ways of building reliable, production ready capabilities for enterprise use (commercial and government). And Bryan Halfpap has provided a Quickstart guide which shows you how to build a Hadoop Cluster (leveraging Cloudera’s CDH3). Using Bryan’s guide you can have a Hadoop Cluster up and running in under 20 minutes.
With this post we would like to provide you with some additional tips that flow from these other posts. We will show you how to build clusters even faster using another common tool in community use, Whirr.
What is Whirr? Apache Whirr is a set of libraries for running cloud services. Here is more from http://whirr.apache.org/
- A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
- A common service API. The details of provisioning are particular to the service.
- Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
And the great news is you can use Whirr as a command line tool for deploying clusters.
If you follow the tips below you can use Whirr to quickly standup distributed clusters. Our assumptions in this guide are that you have stood up RedHat severs using our Rackspace tutorial. But if this is not the case you should be able to easily modify the tips below to suit your situation.
SSH into your Rackspace account by terminal window:
sudo ssh [email protected]
After logging in, it is always a good idea to make sure you have the latest packages. In Red Hat, type:
sudo yum upgrade
Now it is time to install Whirr. This is easy since you are running RedHat. RedHat uses YUM, a package management application that makes software installation easy. Type:
yum install whirr
Your installation will be complete in under a minute.
You will now need to generate a keypair for use with Whirr. This will let you enable secure communications with the Whirr cluster without needing passwords. To do that, enter the following command:
ssh-keygen -t rsa -P ”
You will see:
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Just hit “enter”.
You will see something like:
Created directory ‘/root/.ssh’.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
The key’s randomart image is:
+–[ RSA 2048]—-+
| E |
| o o . .. |
| * = . .=..|
| + S . .o +o|
| * . . o .|
| o . o. |
| . … |
| .. |
Now you must define your Whirr cluster. you do that by creating a properties file. For simplicity, you will name it hadoop.properties. You will need your rackspace username and API to fill out the whirr properties file. Your API is found in my account page under “API Access”
You can create the properties file many ways. Here is how to do it in nano:
Now enter the following info in that file, subsituting your login and API info for what you see below:
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+$
Now to launch a cluster, type:
$ whirr launch-cluster –config hadoop.properties
This will take a few moments to run. As it runs you should see messages like:
Starting 1 node(s) with roles [hadoop-datanode]
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]
As things are started up, servers are being automatically built. Keep watching your e-mail, you will be getting notices of server standup. Remember, this is costing you money. When you finish using your clusters you will want to terminate them. You can do that through Whirr or by just nuking the servers using your Rackspace account and control panel.
Note the info being provided in the terminal window. Information is being provided on the instances being stood up. As you skim this info you will notice a couple URL’s are provided that give you a web UI into the namenode and job tracker. For example, mine are:
Namenode web UI available at http://126.96.36.199:50070
Jobtracker web UI available at http://188.8.131.52:50030
You will also see that a site file was created for you at:
You need to update your your local Hadoop configuration to use this file. Type the following commands:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirrrm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
alternatives –display hadoop-0.20-conf
A proxy script was created for you at:
You should now start that proxy. It is there for security reasons. All traffic from the network where your client is running is proxied through the master node of the cluster using an SSH tunnel. This script launches the proxy. Run the following command to launch the script:
If that doesn’t run make sure you have the right permissions on the file by
chmod +rwx hadoop-proxy.sh
Then try again.
With the above you are now able to use your Hadoop Cluster.
Prove that by browsing HDFS:
hadoop fs -ls /
Now it is time to run a MapReduce job! We are going to use one of the example programs provided in the Hadoop installation. The program is in the file Hadoop-*examples*.jar First, a lets review list of options available form the program. See these by entering:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
You will see:
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
So lets put this info to use. We will make a directory put some info in there, and run the wordcount program:
$ export HADOOP_HOME=/usr/lib/hadoop$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
$ hadoop fs -cat output/part-* | head
Now you are off and running.
You now have a platform capable of scaling to very large jobs. And it runs CDH3, the most reliable, capable distribution of Hadoop and related technologies. Let the fun begin!
But one final note. Think about the lifecycle of your system. At some time you will need to spin it down and turn it off. To destroy the cluster gracefully using Whirr, enter this command:
whirr destroy-cluster –config hadoop.properties
Using the information above you can create and better manage Hadoop Clusters on Rackspace very easily. This is how we create our CDH Clusters. In future posts we will show you how to get data prepared for analysis and how to run some queries. We will also provide tips on how to use Cloudera’s free management tools and how to upgrade to Cloudera Enterprise when you are ready.
- Mainstream Business Applications and In-Memory Databases
- IoT: I Don't Care How Big It Is!
- In 2014 Big Data Investments Will Account for Nearly $30 Billion - Eventually Accounting for $76 Billion by 2020 End
- Apple iOS and the Enterprise - Latest Developments
- Power Panel: IoT Poineers Discuss Internet Of Things
- "Cloud Computing 2.0" -- I'm Serious
- Rackspace Positioned in the Leaders Quadrant of Gartner's Magic Quadrant for Cloud-Enabled Managed Hosting in Both North America and Europe
- Microsoft Surface Pro 3 Review
- AppDynamics Closes $120 Million in Growth Financing to Accelerate Platform and Sales Expansion
- Understanding Application Performance on the Network | Part 1
- Understanding Application Performance on the Network | Part 3
- @ThingsExpo Debuts Internet Of Things Newsletter
- Mainstream Business Applications and In-Memory Databases
- The Odd Couple: Marrying Agile and Waterfall
- The Butterfly Effect Within IT
- Findly Enhances Recruiting Efficiency With New Single Sign-on Portal
- Zuora Caps Record Breaking Subscribed 2014 with the 63rd Release of the Award Winning Z-Business Platform
- The Agile PMO
- Flexera Software Increases Commitment to Europe with Germany-based Datacenter to Host FlexNet Manager Suite Cloud for European Customers
- IoT: I Don't Care How Big It Is!
- Complete Surface Pro 3 Review - 3 days later
- CORRECTING and REPLACING Android and the Open Automotive Alliance Shift into the Next Gear
- How to Monitor Swift/iOS8 Applications for Crashes and Performance Issues
- More on OnLive: New Cloud Solution Delivers Secure Cross-Platform Deployment for Graphics Intensive Applications
- Building a Drag-and-Drop Shopping Cart with AJAX
- What Is AJAX?
- Google Maps! AJAX-Style Web Development Using ASP.NET
- Where Are RIA Technologies Headed in 2008?
- Dolphin Announces Open API With Over 50 Add-ons Including Dropbox and Wikipedia
- How and Why AJAX, Not Java, Became the Favored Technology for Rich Internet Applications
- Flashback to January 2006: Exclusive SYS-CON.TV Interviews on "OpenAjax Alliance" Announcement
- "Real-World AJAX" One-Day Seminar Arrives in Silicon Valley
- AJAXWorld Conference & Expo to Take Place October 2-4, 2006, at the Santa Clara Convention Center, California
- AJAX Sponsor Webcasts Are Now Available at AJAXWorld Website
- AJAXWorld University Announces AJAX Developer Bootcamp
- i-Technology 2008 Predictions: Where's RIAs, AJAX, SOA and Virtualization Headed in 2008?
The 16th International Cloud Expo announces that its Call for Papers is now open. 16th International Cloud Expo, to be held June 9–11, 2015, at the Javits Center in New York City brings together Cloud Computing, APM, APIs, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal today!
Sep. 1, 2014 01:45 PM EDT Reads: 1,006
14th International Cloud Expo, held on June 10–12, 2014 at the Javits Center in New York City, featured three content-packed days with a rich array of sessions about the business and technical value of cloud computing, Internet of Things, Big Data, and DevOps led by exceptional speakers from every sector of the IT ecosystem. The Cloud Expo series is the fastest-growing Enterprise IT event in the past 10 years, devoted to every aspect of delivering massively scalable enterprise IT as a service.
Aug. 29, 2014 11:00 PM EDT Reads: 2,173
Hardware will never be more valuable than on the day it hits your loading dock. Each day new servers are not deployed to production the business is losing money. While Moore’s Law is typically cited to explain the exponential density growth of chips, a critical consequence of this is rapid depreciation of servers. The hardware for clustered systems (e.g., Hadoop, OpenStack) tends to be significant capital expenses. In his session at 15th Cloud Expo, Mason Katz, CTO and co-founder of StackIQ, to discuss how infrastructure teams should be aware of the capitalization and depreciation model of these expenses to fully understand when and where automation is critical.
Aug. 27, 2014 02:30 PM EDT Reads: 1,840
Over the last few years the healthcare ecosystem has revolved around innovations in Electronic Health Record (HER) based systems. This evolution has helped us achieve much desired interoperability. Now the focus is shifting to other equally important aspects – scalability and performance. While applying cloud computing environments to the EHR systems, a special consideration needs to be given to the cloud enablement of Veterans Health Information Systems and Technology Architecture (VistA), i.e., the largest single medical system in the United States.
Aug. 26, 2014 12:00 PM EDT Reads: 2,438
In his session at 15th Cloud Expo, Mark Hinkle, Senior Director, Open Source Solutions at Citrix Systems Inc., will provide overview of the open source software that can be used to deploy and manage a cloud computing environment. He will include information on storage, networking(e.g., OpenDaylight) and compute virtualization (Xen, KVM, LXC) and the orchestration(Apache CloudStack, OpenStack) of the three to build their own cloud services. Speaker Bio: Mark Hinkle is the Senior Director, Open Source Solutions, at Citrix Systems Inc. He joined Citrix as a result of their July 2011 acquisition of Cloud.com where he was their Vice President of Community. He is currently responsible for Citrix open source efforts around the open source cloud computing platform, Apache CloudStack and the Xen Hypervisor. Previously he was the VP of Community at Zenoss Inc., a producer of the open source application, server, and network management software, where he grew the Zenoss Core project to over 10...
Aug. 25, 2014 07:00 PM EDT Reads: 2,427
Most of today’s hardware manufacturers are building servers with at least one SATA Port, but not every systems engineer utilizes them. This is considered a loss in the game of maximizing potential storage space in a fixed unit. The SATADOM Series was created by Innodisk as a high-performance, small form factor boot drive with low power consumption to be plugged into the unused SATA port on your server board as an alternative to hard drive or USB boot-up. Built for 1U systems, this powerful device is smaller than a one dollar coin, and frees up otherwise dead space on your motherboard. To meet the requirements of tomorrow’s cloud hardware, Innodisk invested internal R&D resources to develop our SATA III series of products. The SATA III SATADOM boasts 500/180MBs R/W Speeds respectively, or double R/W Speed of SATA II products.
Aug. 25, 2014 06:00 PM EDT Reads: 7,319
As more applications and services move "to the cloud" (public or on-premise) cloud environments are increasingly adopting and building out traditional enterprise features. This in turn is enabling and encouraging cloud adoption from enterprise users. In many ways the definition is blurring as features like continuous operation, geo-distribution or on-demand capacity become the norm. NuoDB is involved in both building enterprise software and using enterprise cloud capabilities. In his session at 15th Cloud Expo, Seth Proctor, CTO at NuoDB, Inc., will discuss the experiences from building, deploying and using enterprise services and suggest some ways to approach moving enterprise applications into a cloud model.
Aug. 20, 2014 06:45 PM EDT Reads: 2,438
Until recently, many organizations required specialized departments to perform mapping and geospatial analysis, and they used Esri on-premise solutions for that work. In his session at 15th Cloud Expo, Dave Peters, author of the Esri Press book Building a GIS, System Architecture Design Strategies for Managers, will discuss how Esri has successfully included the cloud as a fully integrated SaaS expansion of the ArcGIS mapping platform. Organizations that have incorporated Esri cloud-based applications and content within their business models are reaping huge benefits by directly leveraging cloud-based mapping and analysis capabilities within their existing enterprise investments. The ArcGIS mapping platform includes cloud-based content management and information resources to more widely, efficiently, and affordably deliver real-time actionable information and analysis capabilities to your organization.
Aug. 20, 2014 10:00 AM EDT Reads: 1,729
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity. In his session at Internet of @ThingsExpo, Mac Devine, Distinguished Engineer at IBM, will discuss bringing these three elements together via Systems of Discover.
Aug. 17, 2014 02:30 PM EDT Reads: 3,312
Cloud and Big Data present unique dilemmas: embracing the benefits of these new technologies while maintaining the security of your organization’s assets. When an outside party owns, controls and manages your infrastructure and computational resources, how can you be assured that sensitive data remains private and secure? How do you best protect data in mixed use cloud and big data infrastructure sets? Can you still satisfy the full range of reporting, compliance and regulatory requirements? In his session at 15th Cloud Expo, Derek Tumulak, Vice President of Product Management at Vormetric, will discuss how to address data security in cloud and Big Data environments so that your organization isn’t next week’s data breach headline.
Aug. 16, 2014 07:00 PM EDT Reads: 1,898
The cloud is everywhere and growing, and with it SaaS has become an accepted means for software delivery. SaaS is more than just a technology, it is a thriving business model estimated to be worth around $53 billion dollars by 2015, according to IDC. The question is – how do you build and scale a profitable SaaS business model? In his session at 15th Cloud Expo, Jason Cumberland, Vice President, SaaS Solutions at Dimension Data, will give the audience an understanding of common mistakes businesses make when transitioning to SaaS; how to avoid them; and how to build a profitable and scalable SaaS business.
Aug. 16, 2014 01:00 PM EDT Reads: 2,457
SYS-CON Events announced today that Gridstore™, the leader in software-defined storage (SDS) purpose-built for Windows Servers and Hyper-V, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Gridstore™ is the leader in software-defined storage purpose built for virtualization that is designed to accelerate applications in virtualized environments. Using its patented Server-Side Virtual Controller™ Technology (SVCT) to eliminate the I/O blender effect and accelerate applications Gridstore delivers vmOptimized™ Storage that self-optimizes to each application or VM across both virtual and physical environments. Leveraging a grid architecture, Gridstore delivers the first end-to-end storage QoS to ensure the most important App or VM performance is never compromised. The storage grid, that uses Gridstore’s performance optimized nodes or capacity optimized nodes, starts with as few a...
Aug. 15, 2014 06:30 PM EDT Reads: 1,719
SYS-CON Events announced today that Solgenia, the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions, will exhibit at SYS-CON's 15th International Cloud Expo®, which will take place on November 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between personal and professional social, mobile and cloud user experiences, our solutions help large and medium-sized organizations dramatically improve productivity, reduce collaboration costs, and increase the overall enterprise value by bringing collaboration and infrastructure solutions to the cloud.
Aug. 15, 2014 02:00 PM EDT Reads: 1,716
Cloud computing started a technology revolution; now DevOps is driving that revolution forward. By enabling new approaches to service delivery, cloud and DevOps together are delivering even greater speed, agility, and efficiency. No wonder leading innovators are adopting DevOps and cloud together! In his session at DevOps Summit, Andi Mann, Vice President of Strategic Solutions at CA Technologies, will explore the synergies in these two approaches, with practical tips, techniques, research data, war stories, case studies, and recommendations.
Aug. 13, 2014 09:45 PM EDT Reads: 2,586
Enterprises require the performance, agility and on-demand access of the public cloud, and the management, security and compatibility of the private cloud. The solution? In his session at 15th Cloud Expo, Simone Brunozzi, VP and Chief Technologist(global role) for VMware, will explore how to unlock the power of the hybrid cloud and the steps to get there. He'll discuss the challenges that conventional approaches to both public and private cloud computing, and outline the tough decisions that must be made to accelerate the journey to the hybrid cloud. As part of the transition, an Infrastructure-as-a-Service model will enable enterprise IT to build services beyond their data center while owning what gets moved, when to move it, and for how long. IT can then move forward on what matters most to the organization that it supports – availability, agility and efficiency.
Aug. 12, 2014 10:30 PM EDT Reads: 1,873