|By Bob Gourley||
|March 5, 2012 03:30 AM EST||
We have previously provided a Quickstart guide to standing up Rackspace cloud servers (and have one for Amazon servers as well). These are very low cost ways of building reliable, production ready capabilities for enterprise use (commercial and government). And Bryan Halfpap has provided a Quickstart guide which shows you how to build a Hadoop Cluster (leveraging Cloudera’s CDH3). Using Bryan’s guide you can have a Hadoop Cluster up and running in under 20 minutes.
With this post we would like to provide you with some additional tips that flow from these other posts. We will show you how to build clusters even faster using another common tool in community use, Whirr.
What is Whirr? Apache Whirr is a set of libraries for running cloud services. Here is more from http://whirr.apache.org/
- A cloud-neutral way to run services. You don’t have to worry about the idiosyncrasies of each provider.
- A common service API. The details of provisioning are particular to the service.
- Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.
And the great news is you can use Whirr as a command line tool for deploying clusters.
If you follow the tips below you can use Whirr to quickly standup distributed clusters. Our assumptions in this guide are that you have stood up RedHat severs using our Rackspace tutorial. But if this is not the case you should be able to easily modify the tips below to suit your situation.
SSH into your Rackspace account by terminal window:
sudo ssh [email protected]
After logging in, it is always a good idea to make sure you have the latest packages. In Red Hat, type:
sudo yum upgrade
Now it is time to install Whirr. This is easy since you are running RedHat. RedHat uses YUM, a package management application that makes software installation easy. Type:
yum install whirr
Your installation will be complete in under a minute.
You will now need to generate a keypair for use with Whirr. This will let you enable secure communications with the Whirr cluster without needing passwords. To do that, enter the following command:
ssh-keygen -t rsa -P ”
You will see:
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Just hit “enter”.
You will see something like:
Created directory ‘/root/.ssh’.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
c6:31:f7:f5:97:e4:8c:b3:2a:f4:0d:a0:93:e4:c1:06 [email protected]
The key’s randomart image is:
+–[ RSA 2048]—-+
| E |
| o o . .. |
| * = . .=..|
| + S . .o +o|
| * . . o .|
| o . o. |
| . … |
| .. |
Now you must define your Whirr cluster. you do that by creating a properties file. For simplicity, you will name it hadoop.properties. You will need your rackspace username and API to fill out the whirr properties file. Your API is found in my account page under “API Access”
You can create the properties file many ways. Here is how to do it in nano:
Now enter the following info in that file, subsituting your login and API info for what you see below:
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+$
Now to launch a cluster, type:
$ whirr launch-cluster –config hadoop.properties
This will take a few moments to run. As it runs you should see messages like:
Starting 1 node(s) with roles [hadoop-datanode]
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]
As things are started up, servers are being automatically built. Keep watching your e-mail, you will be getting notices of server standup. Remember, this is costing you money. When you finish using your clusters you will want to terminate them. You can do that through Whirr or by just nuking the servers using your Rackspace account and control panel.
Note the info being provided in the terminal window. Information is being provided on the instances being stood up. As you skim this info you will notice a couple URL’s are provided that give you a web UI into the namenode and job tracker. For example, mine are:
Namenode web UI available at http://18.104.22.168:50070
Jobtracker web UI available at http://22.214.171.124:50030
You will also see that a site file was created for you at:
You need to update your your local Hadoop configuration to use this file. Type the following commands:
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.whirrrm -f /etc/hadoop-0.20/conf.whirr/*-site.xml
cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop-0.20/conf.whirr
alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.whirr 50
alternatives –display hadoop-0.20-conf
A proxy script was created for you at:
You should now start that proxy. It is there for security reasons. All traffic from the network where your client is running is proxied through the master node of the cluster using an SSH tunnel. This script launches the proxy. Run the following command to launch the script:
If that doesn’t run make sure you have the right permissions on the file by
chmod +rwx hadoop-proxy.sh
Then try again.
With the above you are now able to use your Hadoop Cluster.
Prove that by browsing HDFS:
hadoop fs -ls /
Now it is time to run a MapReduce job! We are going to use one of the example programs provided in the Hadoop installation. The program is in the file Hadoop-*examples*.jar First, a lets review list of options available form the program. See these by entering:
hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
You will see:
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
So lets put this info to use. We will make a directory put some info in there, and run the wordcount program:
$ export HADOOP_HOME=/usr/lib/hadoop$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar wordcount input output
$ hadoop fs -cat output/part-* | head
Now you are off and running.
You now have a platform capable of scaling to very large jobs. And it runs CDH3, the most reliable, capable distribution of Hadoop and related technologies. Let the fun begin!
But one final note. Think about the lifecycle of your system. At some time you will need to spin it down and turn it off. To destroy the cluster gracefully using Whirr, enter this command:
whirr destroy-cluster –config hadoop.properties
Using the information above you can create and better manage Hadoop Clusters on Rackspace very easily. This is how we create our CDH Clusters. In future posts we will show you how to get data prepared for analysis and how to run some queries. We will also provide tips on how to use Cloudera’s free management tools and how to upgrade to Cloudera Enterprise when you are ready.
Ask someone to architect an Internet of Things (IoT) solution and you are guaranteed to see a reference to the cloud. This would lead you to believe that IoT requires the cloud to exist. However, there are many IoT use cases where the cloud is not feasible or desirable. In his session at @ThingsExpo, Dave McCarthy, Director of Products at Bsquare Corporation, will discuss the strategies that exist to extend intelligence directly to IoT devices and sensors, freeing them from the constraints of ...
Oct. 26, 2016 01:15 AM EDT Reads: 3,099
So you think you are a DevOps warrior, huh? Put your money (not really, it’s free) where your metrics are and prove it by taking The Ultimate DevOps Geek Quiz Challenge, sponsored by DevOps Summit. Battle through the set of tough questions created by industry thought leaders to earn your bragging rights and win some cool prizes.
Oct. 26, 2016 12:15 AM EDT Reads: 4,119
A completely new computing platform is on the horizon. They’re called Microservers by some, ARM Servers by others, and sometimes even ARM-based Servers. No matter what you call them, Microservers will have a huge impact on the data center and on server computing in general. Although few people are familiar with Microservers today, their impact will be felt very soon. This is a new category of computing platform that is available today and is predicted to have triple-digit growth rates for some ...
Oct. 26, 2016 12:00 AM EDT Reads: 34,220
SYS-CON Events announced today that SoftNet Solutions will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. SoftNet Solutions specializes in Enterprise Solutions for Hadoop and Big Data. It offers customers the most open, robust, and value-conscious portfolio of solutions, services, and tools for the shortest route to success with Big Data. The unique differentiator is the ability to architect and ...
Oct. 26, 2016 12:00 AM EDT Reads: 1,040
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm ...
Oct. 26, 2016 12:00 AM EDT Reads: 3,859
Fifty billion connected devices and still no winning protocols standards. HTTP, WebSockets, MQTT, and CoAP seem to be leading in the IoT protocol race at the moment but many more protocols are getting introduced on a regular basis. Each protocol has its pros and cons depending on the nature of the communications. Does there really need to be only one protocol to rule them all? Of course not. In his session at @ThingsExpo, Chris Matthieu, co-founder and CTO of Octoblu, walk you through how Oct...
Oct. 26, 2016 12:00 AM EDT Reads: 3,188
Everyone knows that truly innovative companies learn as they go along, pushing boundaries in response to market changes and demands. What's more of a mystery is how to balance innovation on a fresh platform built from scratch with the legacy tech stack, product suite and customers that continue to serve as the business' foundation. In his General Session at 19th Cloud Expo, Michael Chambliss, Head of Engineering at ReadyTalk, will discuss why and how ReadyTalk diverted from healthy revenue an...
Oct. 25, 2016 11:45 PM EDT Reads: 2,991
In past @ThingsExpo presentations, Joseph di Paolantonio has explored how various Internet of Things (IoT) and data management and analytics (DMA) solution spaces will come together as sensor analytics ecosystems. This year, in his session at @ThingsExpo, Joseph di Paolantonio from DataArchon, will be adding the numerous Transportation areas, from autonomous vehicles to “Uber for containers.” While IoT data in any one area of Transportation will have a huge impact in that area, combining sensor...
Oct. 25, 2016 10:45 PM EDT Reads: 988
SYS-CON Events announced today that Cemware will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Use MATLAB functions by just visiting website mathfreeon.com. MATLAB compatible, freely usable, online platform services. As of October 2016, 80,000 users from 180 countries are enjoying our platform service.
Oct. 25, 2016 09:15 PM EDT Reads: 882
Established in 1998, Calsoft is a leading software product engineering Services Company specializing in Storage, Networking, Virtualization and Cloud business verticals. Calsoft provides End-to-End Product Development, Quality Assurance Sustenance, Solution Engineering and Professional Services expertise to assist customers in achieving their product development and business goals. The company's deep domain knowledge of Storage, Virtualization, Networking and Cloud verticals helps in delivering ...
Oct. 25, 2016 08:30 PM EDT Reads: 1,106
SYS-CON Events announced today that Enzu will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to focus on the core of their online busine...
Oct. 25, 2016 08:15 PM EDT Reads: 1,378
November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Penta Security is a leading vendor for data security solutions, including its encryption solution, D’Amo. By using FPE technology, D’Amo allows for the implementation of encryption technology to sensitive data fields without modification to schema in the database environment. With businesses having their data become increasingly more complicated in their mission-critical applications (such as ERP, CRM, HRM), continued ...
Oct. 25, 2016 07:30 PM EDT Reads: 1,123
In the next five to ten years, millions, if not billions of things will become smarter. This smartness goes beyond connected things in our homes like the fridge, thermostat and fancy lighting, and into heavily regulated industries including aerospace, pharmaceutical/medical devices and energy. “Smartness” will embed itself within individual products that are part of our daily lives. We will engage with smart products - learning from them, informing them, and communicating with them. Smart produc...
Oct. 25, 2016 07:30 PM EDT Reads: 1,544
OnProcess Technology has announced it will be a featured speaker at @ThingsExpo, taking place November 1 - 3, 2016, in Santa Clara, California. Dan Gettens, OnProcess’ Chief Analytics Officer, will discuss how Internet of Things (IoT) data can be leveraged to predict product failures, improve uptime and slash costly inventory stock. @ThingsExpo is an annual gathering of IoT and cloud developers, practitioners and thought-leaders who exchange ideas and insights on topics ranging from Big Data in...
Oct. 25, 2016 07:15 PM EDT Reads: 326
[webinar] Cloud Computing: A Roadmap to Modern Software Delivery | @ImpigerTech #API #Cloud #DataCenter
Join Impiger for their featured webinar: ‘Cloud Computing: A Roadmap to Modern Software Delivery’ on November 10, 2016, at 12:00 pm CST. Very few companies have not experienced some impact to their IT delivery due to the evolution of cloud computing. This webinar is not about deciding whether you should entertain moving some or all of your IT to the cloud, but rather, a detailed look under the hood to help IT professionals understand how cloud adoption has evolved and what trends will impact th...
Oct. 25, 2016 07:00 PM EDT Reads: 393
SYS-CON Events announced today that Cloudbric, a leading website security provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Cloudbric is an elite full service website protection solution specifically designed for IT novices, entrepreneurs, and small and medium businesses. First launched in 2015, Cloudbric is based on the enterprise level Web Application Firewall by Penta Security Sys...
Oct. 25, 2016 06:45 PM EDT Reads: 1,247
SYS-CON Events announced today that Transparent Cloud Computing (T-Cloud) Consortium will exhibit at the 19th International Cloud Expo®, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. The Transparent Cloud Computing Consortium (T-Cloud Consortium) will conduct research activities into changes in the computing model as a result of collaboration between "device" and "cloud" and the creation of new value and markets through organic data proces...
Oct. 25, 2016 06:45 PM EDT Reads: 1,446
In his session at 19th Cloud Expo, Nick Son, Vice President of Cyber Risk & Public Sector at Coalfire, will discuss the latest information on the FedRAMP Program. Topics will cover: FedRAMP Readiness Assessment Report (RAR). This new process is designed to streamline and accelerate the FedRAMP process from the traditional timeline by initially focusing on technical capability instead of documentation preparedness. FedRAMP for High-impact level systems. Early in 2016 FedRAMP officially publishe...
Oct. 25, 2016 06:15 PM EDT Reads: 405
SYS-CON Events announced today that Roundee / LinearHub will exhibit at the WebRTC Summit at @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. LinearHub provides Roundee Service, a smart platform for enterprise video conferencing with enhanced features such as automatic recording and transcription service. Slack users can integrate Roundee to their team via Slack’s App Directory, and '/roundee' command lets your video conference ...
Oct. 25, 2016 06:15 PM EDT Reads: 2,145
Traditional on-premises data centers have long been the domain of modern data platforms like Apache Hadoop, meaning companies who build their business on public cloud were challenged to run Big Data processing and analytics at scale. But recent advancements in Hadoop performance, security, and most importantly cloud-native integrations, are giving organizations the ability to truly gain value from all their data. In his session at 19th Cloud Expo, David Tishgart, Director of Product Marketing ...
Oct. 25, 2016 06:00 PM EDT Reads: 2,716