Welcome!

Machine Learning Authors: Pat Romanski, Liz McMillan, Yeshim Deniz, Elizabeth White, Corey Roth

Related Topics: @CloudExpo, Microservices Expo, Machine Learning , Agile Computing, Cloud Security, @DXWorldExpo, SDN Journal

@CloudExpo: Article

Deep Insight and Collaboration in the Cloud: A Customer Story

The true power of continuous holistic APM approach in a cloud-based environment

Recently, one of our customers, let's call him PointInFact, had a very typical problem. After deploying a new version of its software, some user requests degraded horribly. Requests that should have taken half a second took up to a minute. Interestingly, the PointInFact team runs a multi-tenant SaaS solution in the AWS Cloud and relies heavily on cloud services. This reliance makes User Experience Management and fault domain isolation very challenging.

Back Story: Application Running in the AWS Cloud
PointInFact runs a SaaS service. Internally this results in a multi-tenant service where each customer has his own instance of the application he subscribes to. All of these applications and services are hosted in Amazon's EC2 Cloud where they dynamically create new application environments and offload some functionality to AWS by using the provided services. As a SaaS business, customer satisfaction is very important to them, as a consequence they monitor all applications and services centrally and from an end-user and server-side perspective with Compuware APM.

Performance Degradation
After one deployment, the APM solution informed the operations team that user experience was degrading. A look at the geographical distribution showed them that this was not a localized phenomenon, but worldwide.

Worldwide distribution of End User Experience

Notice all the red circles in the above screenshot. Each red circle indicates frustrated users. One particular interesting fact in this dashboard is that the average web page response time (upper right corner) remains stable and well below the one-second mark. This means that the system in general is still running fine and not in general melt-down mode. However, it also shows why it is not good to rely on averages for actual monitoring and why server-side response times alone are not enough. Your end-users are at the edge, around the world, and not sitting next to one of Amazon's data centers.

The next thing that the operations team did was look at the application flow. They were hoping for something big to show up immediately, but nothing much out of the ordinary showed up.

Complete Application Flow

This is not really surprising; they were looking at an application flow overview of about half a million transactions - the averaging effect in full action.

The interesting takeaway however was although user experience suffered across the board, it could not be attributed to a general meltdown of the environment. It was time to look at specific transaction types and their baselines.

High-level Performance dashboard that shows a response time violation

In the dashboard above, the marked and highlighted upper right chart shows that one particular service call in the application was off the charts. The dashboard also shows that at the same time the CPU (lower left corner) of one of their servers was exhausted. Were the two events related even though they occurred on different hosts? A detailed look at the offending request type revealed something very interesting.

Detailed Performance Analysis of the offending request showing that most of the CPU is spent in XML and XSLT processing

The highlighted chart in the middle shows the CPU distribution of the offending service calls. CPU was spiking and the root cause could be attributed to XML processing and subsequent XSL transformations. This is indicated by the yellow and blue bars that represent XML and XSL processing, respectively. This was the reason for the CPU exhaustion noticed earlier.

However, having determined the likely root cause for the slow down the PointInFact team took a step back and asked which users and documents were impacted by this.

This shows which End Users are impacted by the performance issue - focus on the last page load taking ~360s

This was very important for two reasons. First, it allowed them to be proactive with their own users who experienced slowdowns. Second, it further isolated the real problem area.

The Performance Bottleneck That Should Not Be
Now that the trigger for the slowdown was revealed, the performance team looked into the root cause. When looking at the Transaction Flow for the impacted business transactions two things stood out.

The Application Flow for the offending requests that most time is spent in the service on the lower right corner

One can see that most of the response time is spent in the document request service (lower-right corner). In addition, they knew from the previous dashboard that the application tier consumed a lot of CPU in the XML/XSLT processing. The conclusion to the performance team was clear, caching did not work.

To understand this, we need to know that the document requests and subsequent transformations should only happen once per document. After that, all follow-up requests should take the result from the Cache. PointInFact is leveraging the memcached-compliant AWS ElastiCache for this purpose. What the analysis revealed was that the same document was transformed many times, hence caching was not working!

The obvious conclusion was that there was a problem with ElastiCache. As this was a third-party component, the customer needed more information before approaching Amazon with support requests. Thanks to their APM strategy they had sufficient insight into their usage of ElastiCache in production. This turned out to be very good, because otherwise opening a support ticket for ElastiCache would not only have been time consuming, it would have also been futile as we shall see.

Do or Do Not Cache, There Is No Try...
In an attempt to get more information about the caching problem, the customer identified the real root cause. While each of the offending document requests was doing a cache lookup upfront, none of them put the result in the cache afterwards. There was no problem with the cache; it simply was not being used.

At this point, the development team started looking into it at the code level and could identify a problem with the cache client library that they were using. That cache client was also a third-party component, but now they had something tangible to share with the maintainers of the cache client. Long story short, the problem was fixed upstream and one deployment later the problem was solved to everybody's satisfaction.

Conclusion
To me, this story shows the true power of continuous holistic APM approach in a cloud-based environment.

  • The customer was able identify a problem in production that had a big end-user impact, although the average transaction was still considered fast.
  • The operations team could identify exactly which users were impacted and be proactive in their customer support.
  • More importantly the R&D team was able to identify the real root cause in one third-party component while avoiding a lengthy back and forth with another third-party vendor (Amazon ElastiCache) which would have been futile.

Finally, PointInFact was able to track down the root cause with sufficient depth to provide the responsible third party with a fix proposal, giving them a faster turnaround time on a permanent solution. And all of this in a public globally distributed multi-tenant cloud application.

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@CloudExpo Stories
DXWorldEXPO LLC announced today that the upcoming DXWorldEXPO | CloudEXPO New York event will feature 10 companies from Poland to participate at the "Poland Digital Transformation Pavilion" on November 12-13, 2018.
Without a clear strategy for cost control and an architecture designed with cloud services in mind, costs and operational performance can quickly get out of control. To avoid multiple architectural redesigns requires extensive thought and planning. Boundary (now part of BMC) launched a new public-facing multi-tenant high resolution monitoring service on Amazon AWS two years ago, facing challenges and learning best practices in the early days of the new service.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
The best way to leverage your CloudEXPO | DXWorldEXPO presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering CloudEXPO | DXWorldEXPO will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at CloudEXPO. Product announcements during our show provide your company with the most reach through our targeted audienc...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors!
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smart...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
We all know that end users experience the internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices - not doing so will be a path to eventual ...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
Sanjeev Sharma Joins November 11-13, 2018 @DevOpsSummit at @CloudEXPO New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
As Cybric's Chief Technology Officer, Mike D. Kail is responsible for the strategic vision and technical direction of the platform. Prior to founding Cybric, Mike was Yahoo's CIO and SVP of Infrastructure, where he led the IT and Data Center functions for the company. He has more than 24 years of IT Operations experience with a focus on highly-scalable architectures.
Headquartered in Plainsboro, NJ, Synametrics Technologies has provided IT professionals and computer systems developers since 1997. Based on the success of their initial product offerings (WinSQL and DeltaCopy), the company continues to create and hone innovative products that help its customers get more from their computer applications, databases and infrastructure. To date, over one million users around the world have chosen Synametrics solutions to help power their accelerated business or per...
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT staff augmentation services for software technology providers. By providing clients with unparalleled niche technology expertise and industry experience, Chetu has become the premiere long-term, back-end software development partner for start-ups, SMBs, and Fortune 500 companies. Chetu is headquartered in Plantation, Florida, with thirteen offices throughout the U.S. and abroad.
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.