Welcome!

Machine Learning Authors: Yeshim Deniz, Pat Romanski, Liz McMillan, Elizabeth White, Corey Roth

Related Topics: Java IoT, Microservices Expo, Microsoft Cloud, Machine Learning , Agile Computing, @DXWorldExpo

Java IoT: Article

Part 2: An Integrated Approach to Load Test Analysis

The Follow-up Test

In a Part 1, I demonstrated how to add more depth to the analysis of a Compuware APM Web Load Test by combining the external load results with the application and infrastructure data collected by the Compuware PureStack Technology. But, now that we have tested the system once, what would happen if we tested it again after we identified and "resolved" the issues we found? Would running a test using the same parameters as in the initial test show a clear performance improvement? Would the system be able to achieve the desired load of 200 virtual users with little or no performance degradation?

This article takes you through the steps you should follow in order to directly compare the results of two load tests and measure the performance improvement (or degradation) that occurred with the fixes put in place.

Step 1: Identify issues and implement changes based on initial results
During the April 14 load test session, Andreas Grabner and I found that there were substantial performance concerns for the application under a load that is well in excess of what is currently seen even on the APM Community's busiest days. The issue was that the load that caused the performance issues was well short of the goal of 200 virtual users (VUs) that the application team wanted to reach.

An Aggregated Data View Showing External and Internal Performance Indicators During the April 14 Load Test

During the April 14 load test execution, a number of environment issues were identified. The critical ones the systems team addressed by included:

  • Deployment of critical APM Community applications to different machines to prevent the application performance of one layer negatively affecting another layer
  • Optimization of the way APM Community pages are built in the application layer to reduce CPU usage
  • Optimized Cache Settings in Confluence to reduce roundtrips to the database when loading commonly used objects
  • Increasing the CPU power on the virtualized machines so that they can handle more load.

Step 2: Re-Run the test (With the same parameters!)
Once these steps were complete, a second test cycle was scheduled to determine if the updated environment would be able to reach the desired 200 VU target without encountering response time degradation. The follow-up load test was executed exactly one week later, on April 21, and used the same parameters as the initial test (see previous post for load ramping details). Using the same test parameters (load ramp, test scripts, testing locations, databanks, etc.) is critical in order to allow a like-for-like comparison to occur. Any deviation in the test configuration can skew the results and potentially lead to an unintended sense of confidence (or fear of implosion) regarding the application environment.

When the April 21 round of load testing was complete and we began to analyze the results of the test, the initial data (higher throughput, faster response times, lower CPU utilization and a reduction in the amount of database load) suggested that this load test was substantially more successful than the previous test execution. This initial conclusion was based on the performance charts containing the same metrics we used to analyze the April 14 test, which showed a direct comparison of critical measures, demonstrating if the pattern of performance had dramatically changed between the two test executions.

Step 3: Compare the Results
So, to start the comparative analysis, we took three key metrics of the April 14 and April 21 results and charted them together: External Web Load Test Average Response Time; Web Load Test Transactions per Minute; and percentage of CPU Utilization on the web server. Using just these three comparisons, it is clear that the two load tests had very different performance profiles.

Starting with the Web Load Test Average Response Times (the time required to completely download all of the content in the scripted synthetic transactions used in the load test), it is very clear that after 08:50 EDT - 40 minutes into both tests - that the response times diverged and remained on different paths for the remainder of the comparative test run. From this point on, the April 21 load test averaged load times that were around 50% faster than the April 14 test (Note: the Moving Average of percentage change averages 5 minutes of response time change to produce an clearer trend line). It took nearly 20 more minutes for Average Transaction Response Time to reach 20 seconds on April 21, even with load being applied at the same volume as in the April 14 test.

Comparison of Response Times showing improvement in April 21 APM Community Load Test resulting in lower average response times

The Web Load Transactions per Minute (the number of WLT transactions executed in a minute at that point of the load test) showed a pattern where the April 21 test also diverged from the April 14 test at 08:50 EDT. With the faster WLT Average Response Times, the April 21 test saw the system process 40-50% more transactions per minute than the April 14 test from 08:50 EDT until the end of the test cycle.

Comparison of Transactions per Minute showing improvement in April 21 APM Community Load Test resulting in higher transactions per minute

Much of this improvement can be tracked to the third metric: CPU utilization on the Web Server (the percentage of CPU used by the system and applications for performing all necessary activities on the machine). Throughout the April 21 test, the CPU of the web server, with more hardware and optimized page rendering processes helping out, the CPU was less heavy stressed throughout the test, reaching 100% utilization much later than in the April 14 test.

Comparison of CPU Utilization showing improvement in April 21 APM Community Load Test resulting in lower CPU utilization up until 09:40 EDT

These three metrics are directly tied to the Number of Web Requests per Minuterecorded at the Confluence application layer for the April 21 test. This metric peaked at 125-140 per minute during the April 21 test, compared to the April 14 test where the peak was at approximately 100 Web Requests per minute.

Despite the seeming success of the second load test on April 21, there were still issues that appeared. Building an integrated results chart for the April 21 load test shows that multiple performance events occurred once the load test reached the 100% CPU Utilization boundary (red vertical line in chart below). This appears to indicate that despite the improvements to the environment discussed above, there is still a CPU bottleneck present at higher loads.

An area of extreme contrast between the two tests was recorded in the Database Results. Database stats were clearly visible in the data from April 14 test (see the aggregated performance metric chart in Step 1), including a large spike in the number and length of queries just before the application reached the CPU bottleneck. But in order to find the same metrics in the April 21 test, you have to break out your microscope and look very closely at the bottom of the chart.

External and Internal Performance Metrics for April 21 Load Test

The reduction in database load was the direct result of the optimized cache settings enabled after the April 14 load test. With more of the data being stored in the application cache, the number of calls to the database decreased, removing this layer as a potential bottleneck at this load volume.

Step 4: Results and Next Steps
The lack of a sudden spike in Confluence/Atlassian processing time in the April 21 test (along with the accompanying database spike) was due to the removal of an application layer process that had been scheduled to run during the load test period. This process, and its effects on the systems and user experience, was quickly recognized once Andreas reviewed his data. Once the job that caused this issue was identified, it was removed in time for the April 21 test, completely eliminating a performance bottleneck that was encountered early in the April 14 test.

Lesson learned: Don't schedule system-intensive jobs to run during peak traffic periods; find a window with the lowest traffic volume to perform these tasks so that the fewest visitors possible are affected.

As we noted at the start of this post, it appears on the surface that the April 21 load test was more successful than the April 14 test. Yet, despite the improved performance of the April 21 load test, the results still show that there are still performance concerns in the test that need to be addressed. These concerns center around a dramatic spike in response times between 09:40 and 09:50 EDT, occurring after the load test had been running for 90 minutes.

When the system began to show degraded performance, it could easily be tracked using the 3 key metrics: WLT Average Response Time; WLT Transactions per minute; and CPU Utilization. When running transactions began to take much longer to execute, decreasing both the number of incoming web requests to the application layer and the number of transactions per minute executed by the load generation system, the root cause can be seen in the chart below, which removes some of the data series.

At 167 VUs, the system redlines, and has a sudden, 10-minute degradation of performance, after which it recovers when the test stabilizes at 200 VUs

The period of degradation that was detected during the load test started at 09:40 EDT and coincided with the:

  • Web Load Test achieving 167 VUs
  • CPU on the web server measuring 100%
  • Web Load Test Transactions per Minute averaging 130
  • Confluence Web Requests (the application layer of the APM Community Portal) measured at 135 per minute

Interestingly, after 10 minutes, this issue cleared up completely, except for transaction response times. The response times did not return to pre-spike values, but were now averaging almost 20 seconds higher than before the spike. With the system now peaked at 200 VUs and no additional load being generated, it was interesting to see that other metrics returned immediately to their pre-spike levels - notably Transactions per Minute and Web Requests per Minute. So, with 33 more VUs than before the spike, the system again appeared to be directly affected by a CPU bottleneck, as a higher load could not increase the number of requests processed at the application layer.

Out of this sea of metrics we determined that the performance of the April 21 load test saw a comparative improvement in the application when examined next to the April 14 load test, but the second test was still unable to reach the target of 200 VUs without suffering a bottleneck that caused performance to degrade dramatically.

Analyzing the degradation
To find the cause of the CPU bottleneck that prevented the April 21 test from reaching the goal of 200 VUs with little or no performance degradation, we have to dig deeper into the server-side metrics, especially those related to the health of application server. The dip in transactions throughout the system is aligned with the issue captured when the system hit 167 VUs. The question is: Was the dip in transactions processed and the rise in transaction response times the result of this load volume or a symptom of the actual cause of the performance degradation?

When the system degrades, the server-side data shows that high Garbage Collection could be a problem, as this automated process happened at the same time. It is clear that executing a very intensive system process when the web server CPU was already exhausted can cause a very large performance degradation.

Increased GC is normal while increasing the load - but there is an unusual spike exactly when we see a dip in transaction throughput

Looking at the application server specific transaction response times it is easy to spot the potential problem. The following charts show that "yet another" background job is executed every hour taking away CPU cycles from the already exhausted system.

A background job executes every hour taking up to 300s in CPU cycles at the time when virtual users experienced the performance degradation

Looking at these transactions reveals that the job is an hourly update job that synchronizes the cached user objects with the user directory database. This takes a considerable amount of time because we have 65k+ users on the APM Community system. This update job causes a lot of objects to be created and destroyed - hence the increased memory and GC activity.

The synchronization job is the root cause of the degradation by consuming a lot of CPU as well as allocating memory which causes high GC activity

As with the April 14 load test, the April 21 load test exposed issues with the system that prevented the achievement of the 200 VU goal. But now, we have a clear culprit for the prevention of this goal, so efforts can focus on reducing or eliminating the effect that this update process has on the system when it is under peak load.

Conclusion
In both tests, regardless of how you measure the "success" of a load test, something was learned about the system by aggregating metrics from inside and outside the infrastructure being tested. We now know that the optimizations that were performed after the April 14 load test allowed the system to process 40-50% more transactions per minute up to 167 VUs when a scheduled system process caused a severe application degradation.

This data was only able to be turned into actionable information because we had a process in place that allowed results captured from inside the firewall to be easily aligned with the external results from the load test system. By doing this, the customer, albeit in a very controlled form, becomes a factor in the analysis of system performance.

By creating a full performance perspective, PureStack delivers more than just deeper technical metrics on a system under load. PureStack places the experience of the visitor at the same level of importance as CPU, database, and web requests processed by the application layer when the results are analyzed. The importance of the user experience then dictates how infrastructure issues are prioritized and resolved, as the effect these issues have on end users provides real-world feedback into the true cost of performance issues that occur to your application during peak periods.

Using the data from this load test, it was realized that additional changes to the system were needed, especially in the area of page rendering, in order to further reduce CPU load and allow the system to reach and maintain a peak load of 200 virtual users. With the upgrade to the Confluence application software - deployed in early July 2013 - it was expected that the desired goal would be reached. But assuming this is not sufficient; it is expected that an additional load test on the new Confluence system will occur in July 2013, once the system has been completed stabilized. And using the same transaction paths as in the April 14 and 21 load tests, the system will be verified to confirm that the upgrade is delivering the expected performance.

More Stories By Stephen Pierzchala

With more than a decade in the web performance industry, Stephen Pierzchala has advised many organizations, from Fortune 500 to startups, in how to improve the performance of their web applications by helping them develop and evolve the unique speed, conversion, and customer experience metrics necessary to effectively measure, manage, and evolve online web and mobile applications that improve performance and increase revenue. Working on projects for top companies in the online retail, financial services, content delivery, ad-delivery, and enterprise software industries, he has developed new approaches to web performance data analysis. Stephen has led web performance methodology, CDN Assessment, SaaS load testing, technical troubleshooting, and performance assessments, demonstrating the value of the web performance. He noted for his technical analyses and knowledge of Web performance from the outside-in.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors!
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smart...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
We all know that end users experience the internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices - not doing so will be a path to eventual ...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
Sanjeev Sharma Joins November 11-13, 2018 @DevOpsSummit at @CloudEXPO New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
As Cybric's Chief Technology Officer, Mike D. Kail is responsible for the strategic vision and technical direction of the platform. Prior to founding Cybric, Mike was Yahoo's CIO and SVP of Infrastructure, where he led the IT and Data Center functions for the company. He has more than 24 years of IT Operations experience with a focus on highly-scalable architectures.
Headquartered in Plainsboro, NJ, Synametrics Technologies has provided IT professionals and computer systems developers since 1997. Based on the success of their initial product offerings (WinSQL and DeltaCopy), the company continues to create and hone innovative products that help its customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business or per...
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT staff augmentation services for software technology providers. By providing clients with unparalleled niche technology expertise and industry experience, Chetu has become the premiere long-term, back-end software development partner for start-ups, SMBs, and Fortune 500 companies. Chetu is headquartered in Plantation, Florida, with thirteen offices throughout the U.S. and abroad.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
DXWorldEXPO LLC announced today that Dez Blanchfield joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Dez is a strategic leader in business and digital transformation with 25 years of experience in the IT and telecommunications industries developing strategies and implementing business initiatives. He has a breadth of expertise spanning technologies such as cloud computing, big data and analytics, cognitive computing, m...
"DivvyCloud as a company set out to help customers automate solutions to the most common cloud problems," noted Jeremy Snyder, VP of Business Development at DivvyCloud, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Venafi has a platform that allows you to manage, centralize and automate the complete life cycle of keys and certificates within the organization," explained Gina Osmond, Sr. Field Marketing Manager at Venafi, in this SYS-CON.tv interview at DevOps at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.