Machine Learning Authors: Pat Romanski, William Schmarzo, Yeshim Deniz, Liz McMillan, Jason Bloomberg

Related Topics: Java IoT, Microservices Expo, Microsoft Cloud, Machine Learning , Agile Computing, @DXWorldExpo

Java IoT: Article

Accurately Identify Impact of System Issues on End-User Response Time

Ops and Apps teams now easily prioritize both infrastructure and app fixes

Triggered by current expected load projections for our community portal, our Apps Team was tasked to run a stress on our production system to verify whether we can handle 10 times the load we currently experience on our existing infrastructure. In order to have the least impact in the event the site crumbled under the load, we decided to run the first test on a Sunday afternoon. Before we ran the test we gave our Operations Team a heads-up: they could expect significant load during a two-hour window with the potential to affect other applications that also run on the same environment.

During the test, with both the Ops and Application Teams watching the live performance data, we all saw end-user response time go through the roof and the underlying infrastructure running out of resources when we hit a certain load level. What was very interesting in this exercise is that both the Application and Ops teams looked at the same data but examined the results from a different angle. However, they both relied on the recently announced Compuware PureStack Technology, the first solution that - in combination with dynaTrace PurePath - exposes how IT infrastructure impacts the performance of critical business applications in heavy production environments.

Bridging the Gap between Ops and Apps Data by adding Context: One picture that shows the Hotspots of "Horizontal" Transaction as well as the "Vertical" Stack.

The root cause of the poor performance in our scenario was CPU exhaustion - on a main server machine hosting both the Web and App Server - caused us not to meet our load goal. This turned out to be both an IT provisioning and an application problem. Let me explain the steps these teams took and how they came up with their list of action items in order to improve the current system performance in order to do better in the second scheduled test.

Step 1: Monitor and Identify Infrastructure Health Issues
Operations Teams like having the ability to look at their list of servers and quickly see that all critical indicators (CPU, Memory, Network, Disk, etc.) are green. But when they looked at the server landscape when our load test reached its peak, their dashboard showed them that two of their machines were having problems:

The core server for our community portal shows problems with the CPU and is impacting one of the applications that run on it.

Step 2: What is the actual impact on the hosted applications?
Clicking on the Impacted Applications Tab shows us the applications that run on the affected machine and which ones are currently impacted:

The increased load not only impacts the Community Portal but also our Support Portal

Already the load test has taught us something: As we expect higher load on the community in the future, we might need to move the support portal to a different machine to avoid any impact.

When examined independently, operations-oriented monitoring would not be that telling. But when it is placed in a context that relates it to data (end user response time, user experience, ...) important to the Applications team, both teams gain more insight.  This is a good start, but there is still more to learn.

Step 3: What is the actual impact on the critical transactions?
Clicking on the Community Portal application link shows us the transactions and pages that are actually impacted by the infrastructure issue, but there still are two critical unanswered questions:

  • Are these the transactions that are critical to our successful operation?
  • How badly are these transactions and individual users impacted by the performance issues?

The automatic baseline tells us that our response time for our main community pages shows significant performance impact. This also includes our homepage which is the most valuable page for us.

Step 4: Visualizing the impact of the infrastructure issue on the transaction
The transaction-flow diagram is a great way to get both the Ops and App Teams on the same page and view data in its full context, showing the application tiers involved, the physical and virtual machines they are running on, and where the hotspots are.

The Ops and Apps Teams have one picture that tells them where the Hotspots both in the "Horizontal" Transaction as well as the "Vertical" Stack is.

We knew that our pages are very heavy on content (Images, JavaScript and CSS), with up to 80% of the transaction time spent in the browser. Seeing that this performance hotspot is now down to 50% in relation to the overall page load time we immediately know that more of the transaction time has shifted to the new hotspot: the server side. The good news is that there is no problem with the database (only shows 1% response time contribution) as this entire performance hotspot shift seems to be related to the Web and App Servers, both of which run on the same machine - the one that has these CPU Health Issues.

Step 5: Pinpointing host health issue on the problematic machine
Drilling to the Host Health Dashboard shows what is wrong on that particular server:

The Ops Team immediately sees that the CPU consumption is mainly coming from one Java App Server. There are also some unusual spikes in Network, Disk and Page Faults that all correlated by time.

Step 6: Process Health dashboards show slow app server response
We see that the two main processes on that machine are IIS (Web Server) and Tomcat (Application Server). A closer look shows how they are doing over time:

We are not running out of worker threads. Transfer Rate is rather flat. This tells us that the Web Server is waiting on the response from the Application Server.

It appears that the Application Server is maxing out on CPU. The incoming requests from the load testing tool queue up as the server can't process them in time. The number of processed transactions actually drops.

Step 7: Pinpointing heavy CPU usage
Our Apps Team is now interested in figuring out what consumes all this CPU and whether this is something we can fix in the application code or whether we need more CPU power:

The Hotspot shows two layers of the Application that are heavy on CPU. Lets drill down further.

Our sometimes rather complex pages with lots of Confluence macros cause the major CPU Usage.

Exceptions that capture stack trace information for logging are caused by missing resources and problems with authentication.

Ops and Apps teams now easily prioritize both infrastructure and app fixes
As mentioned, ‘context is everything.' But it's not simply enough to have data - context relies on the ability to intelligently correlate all of the data into a coherent story. When the "horizontal" transactional data for end-user response-time analysis is connected to the "vertical" infrastructure stack information, it becomes easy to get both teams to read from the same page and prioritize fixes that have the greatest negative impact on the business.

This exercise allowed us to identify several action items:

  • Deploy our critical applications on different machines when the applications impact each other negatively
  • Optimize the way our pages are built to reduce CPU usage
  • Increase CPU power on these virtualized machines to handle more load

More Stories By Andreas Grabner

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@CloudExpo Stories
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
"NetApp is known as a data management leader but we do a lot more than just data management on-prem with the data centers of our customers. We're also big in the hybrid cloud," explained Wes Talbert, Principal Architect at NetApp, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settlement products to hedge funds and investment banks. After, he co-founded a revenue cycle management company where he learned about Bitcoin and eventually Ethereal. Andrew's role at ConsenSys Enterprise is a mul...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DXWorldEXPO LLC announced today that "Miami Blockchain Event by FinTechEXPO" has announced that its Call for Papers is now open. The two-day event will present 20 top Blockchain experts. All speaking inquiries which covers the following information can be submitted by email to [email protected] Financial enterprises in New York City, London, Singapore, and other world financial capitals are embracing a new generation of smart, automated FinTech that eliminates many cumbersome, slow, and expe...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: implemen...
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
@DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises - and delivering real results.
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
"We started a Master of Science in business analytics - that's the hot topic. We serve the business community around San Francisco so we educate the working professionals and this is where they all want to be," explained Judy Lee, Associate Professor and Department Chair at Golden Gate University, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DXWorldEXPO LLC announced today that Dez Blanchfield joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Dez is a strategic leader in business and digital transformation with 25 years of experience in the IT and telecommunications industries developing strategies and implementing business initiatives. He has a breadth of expertise spanning technologies such as cloud computing, big data and analytics, cognitive computing, m...
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.