|By Michael Kopp||
|November 4, 2012 09:00 AM EST||
An eCommerce site that crashes seven times during the Christmas season being down for up to five hours each time it crashes is a site that loses a lot of money and reputation. It happened to one of our customers who told this story at our annual performance conference earlier this month. Among the several reasons that led to these crashes I want to share more details on one of them that I see more often with other websites as well. Load balancers on a round-robin instead of least-busy can easily lead to app server crashes caused by heap memory exhaustion. Let's dig into some details on how to identify these problems and how to avoid it.
The Symptom: Crashing Tomcat Instances
The website is deployed on six Tomcats with three front-end Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn't be handled any more with the remaining servers. Figure 1 shows the actual flow of transactions through the system highlighting unevenly distributed response time in the application servers and functional errors being reported on all tiers (red-colored server icon):
Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing
Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of exceptions in the database layer, exceptions thrown between application tiers with the web app responding with HTTP 500s:
Within 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication
The Root Cause: Inefficient Database Statements and Connection Pool Usage
The exceptions caught in the Database Layer (JDBC) were already a very good hint of the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted, which causes problems in the different components of the application:
Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering
Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:
Due to the connection pool problem a single request had to wait 3.8s on average to obtain a connection from the pool
Now - it was not only the size of the pool that was the problem - but - several very inefficient database statements that took long to execute for certain business transactions of the application. This caused the application server to hold on to the connection longer than normal. As the load balancer was configured with Round Robin the app server still got additional requests served. Eventually - just by the random nature of incoming requests - one app server received several of these requests that executed these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions that ultimately also led to a crash of the JVM. Once the first app server crashed, it didn't take too long to take the other app servers down as well.
The Solution: Optimizing App and Load Balancer
The problem was fixed by looking at the slowest database statements and optimizing them for performance by, e.g, adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.
They started by optimizing SQL Statements that took long to execute and those that got executed several times within the same transaction
They also changed the load balancer setting from Round-Robin to Least-Busy, which was the preferred setting from the LB vendor but had simply forgotten to configure in the production environment.
The Result: Site Has Not Been Down Since
Since they made the changes to the application and the load balancer the site has never gone down since. Now - the next holiday season is coming up and they are ready for the next seasonal spikes. Even though they are really confident that everything will work without any problems, they learned their lesson and are approaching performance proactively through proper load testing.
Next Steps: Proactive Performance Management
The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons: 1) they didn't test using expected peak volumes for long enough sessions and 2) didn't use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) that tested their highly interactive web site.
Their strategy for proactive performance management is that they:
- Perform load tests on the production system during low traffic hours (2 a.m.-6 a.m.), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
- Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
- Use a load testing service that uses real browsers in different locations around the U.S.
- Use an APM solution that identifies problems within the application while running the load test.
If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging
Puppet Labs on Wednesday released the DevOps Salary Report, based on salary data gathered from Puppet Labs' industry-recognized State of DevOps Report. The data confirms that market demand for DevOps skills is growing, and that DevOps engineers are among the highest paid IT practitioners today. That's because IT organizations today are grappling with how to be more agile and responsive to the business, while maintaining the stability of their infrastructure. DevOps practices, such as continuous ...
May. 6, 2015 12:15 AM EDT Reads: 2,316
In a world of ever-accelerating business cycles and fast-changing client expectations, the cloud increasingly serves as a growth engine and a path to new business models. Dynamic clouds enable businesses to continuously reinvent themselves, adapting their business processes, their service and software delivery and their operations to achieve speed-to-market and quick response to customer feedback. As the cloud evolves, the industry has multiple competing cloud technologies, offering on-premises ...
May. 6, 2015 12:00 AM EDT Reads: 3,130
SYS-CON Media announced today that Blue Box as launched a popular blog feed on Cloud Computing Journal. Cloud Computing Journal aims to help open the eyes of Enterprise IT professionals to the economics and strategies that utility/cloud computing provides. Blue Box Cloud gives you unequaled agility, without the burden of designing, deploying and managing your own infrastructure. It’s the right choice when public cloud just won’t do. Blue Box Cloud is a managed Private Cloud as a Service (...
May. 6, 2015 12:00 AM EDT Reads: 1,931
This talk focuses on the application of DevOps fundamentals to include network infrastructure. It draws from real deployment case studies on the extension of today's paradigms to address the challenges of the network infrastructures' ability to seamlessly and cohesively integrate into agile workflows. In this session at DevOps Summit, Arista Networks will focus on configuration management using automation with a nod to future work necessary to include telemetry and ephemeral state information....
May. 5, 2015 04:45 PM EDT Reads: 1,130
Avnet, Inc. has announced that it ranked No. 4 on the InformationWeek Elite 100 – a list of the top business technology innovators in the U.S. Avnet was recognized for the development of an innovative cloud-based training system that serves as the foundation for Avnet Academy – the company’s education and training organization focused on technical training around top IT vendor technologies. The development of this system allowed Avnet to quickly expand its IT-related training capabilities around...
May. 5, 2015 04:15 PM EDT Reads: 939
SYS-CON Events announced today that dcVAST, a leader in IT infrastructure management, support service and cloud service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. dcVAST provides cutting-edge IT services and IT infrastructure management services. dcVAST builds robust systems that are simple, secure and serviceable. dcVAST’s IT infrastructure support and IT services expertise can help companies r...
May. 5, 2015 04:02 PM EDT Reads: 615
DevOps Summit, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developmen...
May. 5, 2015 03:45 PM EDT Reads: 901
While Docker continues to be the darling of startups, enterprises and IT innovators around the world, networking continues to be a real mess. Indeed, managing the interaction between Docker containers and networks has always been fraught with complications. Without automation in networking, the vision of running Docker at scale and letting IT run the same apps unchanged on the laptop and in the data center or for any cloud cannot be realized.
May. 5, 2015 03:30 PM EDT Reads: 933
SYS-CON Events announced today Isomorphic Software, the global leader in high-end, web-based business applications, will exhibit at SYS-CON's DevOps Summit 2015 New York, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Isomorphic Software is the global leader in high-end, web-based business applications. We develop, market, and support the SmartClient & Smart GWT HTML5/Ajax platform, combining the productivity and performance of traditional desktop software ...
May. 5, 2015 03:00 PM EDT Reads: 841
SYS-CON Events announced today the DevOps Foundation Certification Course, being held June ?, 2015, in conjunction with DevOps Summit and 16th Cloud Expo at the Javits Center in New York City, NY. This sixteen (16) hour course provides an introduction to DevOps – the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will res...
May. 5, 2015 02:30 PM EDT Reads: 3,397
Docker is becoming very popular--we are seeing every major private and public cloud vendor racing to adopt it. It promises portability and interoperability, and is quickly becoming the currency of the Cloud. In his session at DevOps Summit, Bart Copeland, CEO of ActiveState, discussed why Docker is so important to the future of the cloud, but will also take a step back and show that Docker is actually only one piece of the puzzle. Copeland will outline the bigger picture of where Docker fits a...
May. 5, 2015 02:15 PM EDT Reads: 5,327
A new definition of Big Data & the practical applications of the defined components & associated technical architecture models This presentation introduces a new definition of Big Data, along with the practical applications of the defined components and associated technical architecture models. In his session at Big Data Expo, Tony Shan will start with looking into the concept of Big Data and tracing back the first definition by Doug Laney, and then he will dive deep into the description of 3V...
May. 5, 2015 02:00 PM EDT Reads: 1,141
SYS-CON Events announced today that CenturyLink, Inc., a leader in the network services market, has been named “Platinum Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. CenturyLink is the third largest telecommunications company in the United States and is recognized as a leader in the network services market by technology industry analyst firms. The company is a global leader in cloud infrastructure and ...
May. 5, 2015 01:15 PM EDT Reads: 1,286
As cloud gives an opportunity to businesses to buy services externally – how is cloud impacting your customers? In his General Session at 15th Cloud Expo, Fabio Gori, Director of Worldwide Cloud Marketing at Cisco, provided answers to big questions: Do you see hybrid cloud as where the world is going? What benefits does it bring? And how does Cisco connect all of these clouds? He also discussed Intercloud and Cisco’s investment on it.
May. 5, 2015 12:00 PM EDT Reads: 5,155
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
May. 5, 2015 12:00 PM EDT Reads: 4,490
The Internet of Things Maturity Model (IoTMM) is a qualitative method to gauge the growth and increasing impact of IoT capabilities in an IT environment from both a business and technology perspective. In his session at @ThingsExpo, Tony Shan will first scan the IoT landscape and investigate the major challenges and barriers. The key areas of consideration are identified to get started with IoT journey. He will then pinpoint the need of a tool for effective IoT adoption and implementation, whic...
May. 5, 2015 11:45 AM EDT Reads: 1,167
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY., and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. MangoApps provides private all-in-one social intranets allowing workers to securely collaborate from anywhere in the world and from any device. Social, mobile, and eas...
May. 5, 2015 11:00 AM EDT Reads: 4,406
There is no doubt that Big Data is here and getting bigger every day. Building a Big Data infrastructure today is no easy task. There are an enormous number of choices for database engines and technologies. To make things even more challenging, requirements are getting more sophisticated, and the standard paradigm of supporting historical analytics queries is often just one facet of what is needed. As Big Data growth continues, organizations are demanding real-time access to data, allowing immed...
May. 5, 2015 11:00 AM EDT Reads: 5,809
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch ...
May. 5, 2015 11:00 AM EDT Reads: 3,934
The world's leading Cloud event, Cloud Expo has launched Microservices Journal on the SYS-CON.com portal, featuring over 19,000 original articles, news stories, features, and blog entries. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. Microservices Journal offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Follow new article posts on T...
May. 5, 2015 11:00 AM EDT Reads: 2,946