|By Michael Kopp||
|December 12, 2012 07:30 AM EST||
Anyone who ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real-world example outside of the performance space that I read recently in a newspaper.
The article was explaining that the average salary in a certain region in Europe was 1900 Euro's (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the "average" salary as we would use the word in day to day live. So now let's apply this thinking to application performance.
The Average Response Time
The average response time is by far the most commonly used metric in application performance management. We assume that this represents a "normal" transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.
A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. It rarely ever occurs in real applications
In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.
In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.
This is a typical Response Time Distribution with few but heavy outliers - it has a long tail. The average here is dragged to the right by the long tail.
We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.
You can now argue that this is not a problem as long as the average doesn't look better than the median. I would disagree, but let's look at another real-world scenario experienced by many of our customers:
This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median
In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved - some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real-world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.
Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it's difficult to communicate your analysis to other people - think how many arguments between IT teams have been caused by this. And that's before we even begin to think about communicating with business stakeholders!
A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let's take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.
Automatic Baselining and Alerting
In real-world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most operations teams have to maintain a large number of applications and are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time-consuming.
The industry has come up with a solution called Automatic Baselining. Baselining calculates out the "normal" performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.
Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:
The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved...
Typically, transactions that are outside two times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that); however; if the response time distribution does not represent a bell curve, it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve, then the average can differ a lot from the median; applying a standard deviation to such an average can lead to quite a different result than you would expect. To work around this problem these algorithms have many tunable variables and a lot of "hacks" for specific use cases.
Why I Love Percentiles
A percentile tells me which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:
This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions
The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median) is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the "tail") is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What's important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.
If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.
For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.
In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn't get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario.!
In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don't miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.
The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled "Slow Response Time" on the other hand shows the 90th percentile for the same transaction. These "outliers" also show an increase in response time but not significant enough to trigger an alert.
Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!
How Can We Use Percentiles for Tuning?
Percentiles are also great for tuning, and giving your optimizations a particular goal. Let's say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.
In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.
I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.
Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioral learning and optimizing your application with a proper focus. In short, percentiles are great!
|rtalexander 11/21/12 12:58:00 AM EST|
Hey, could you post a reference or two that covers the theory and/or practicalities of the approach you describe?
- Mainstream Business Applications and In-Memory Databases
- Working with Project Management Software – Who Is Managing Who?
- APM Convergence: Monitoring vs. Management
- Donald Fischer Joins General Catalyst as Venture Partner
- How to Performance Test Automation for GWT and SmartGWT
- Compuware APM Extends Leadership in Big Data
- Achieving Agile Transformation with Kanban, Kotter, and Lean Startup
- The Top Five Benefits of Cloud Computing
- Will These Five Websites Make the Same Mistake Twice During the Big Game?
- Compuware APM Recognized as Trendsetter in Big Data Solutions
- RSA Conference USA 2014 Exhibitor Profiles (A through L)
- Compuware APM Introduces Smart Network Packet Capture and Analysis
- Mainstream Business Applications and In-Memory Databases
- Consumer Electronics - Global Trends, Estimates and Forecasts, 2011-2018
- Working with Project Management Software – Who Is Managing Who?
- APM Convergence: Monitoring vs. Management
- Objective-C Programming: The Big Nerd Ranch Guide (2nd Edition)
- Small Medium Business (SMB) IT Continues to Gain Respect, What About SOHO?
- Donald Fischer Joins General Catalyst as Venture Partner
- Big Data Market: Business Case, Market Analysis and Forecasts 2014 - 2019
- 2014 International CES Exhibitor Profiles: Samsung Electronics America, Inc. to 3D Vision Technologies Limited
- Global Customer Relationship Management (CRM) Software Industry
- Creating JavaServer Faces Maven Managed Projects with Eclipse
- DataStax Hires Clint Smith as General Counsel
- Building a Drag-and-Drop Shopping Cart with AJAX
- What Is AJAX?
- Google Maps! AJAX-Style Web Development Using ASP.NET
- Where Are RIA Technologies Headed in 2008?
- How and Why AJAX, Not Java, Became the Favored Technology for Rich Internet Applications
- Flashback to January 2006: Exclusive SYS-CON.TV Interviews on "OpenAjax Alliance" Announcement
- "Real-World AJAX" One-Day Seminar Arrives in Silicon Valley
- AJAXWorld Conference & Expo to Take Place October 2-4, 2006, at the Santa Clara Convention Center, California
- AJAX Sponsor Webcasts Are Now Available at AJAXWorld Website
- AJAXWorld University Announces AJAX Developer Bootcamp
- AJAX Support In JadeLiquid WebRenderer v3.1
- Dolphin Announces Open API With Over 50 Add-ons Including Dropbox and Wikipedia
A myriad of new methodologies and technologies from cloud computing to BYOD have changed the IT landscape demonstrably. In parallel, higher-value functions around data, customer engagement and revenue growth are pushing CIOs to consider alternative approaches. The nexus is moving IT to a zero footprint infrastructure model. Organizations including Healthcare, Financial Services, Insurance and Airlines are moving in this direction. Unfortunately, traditional IT paradigms present a challenge to adopting these new methodologies.
Mar. 16, 2014 03:00 PM EDT Reads: 1,387
A trendsetter in intimate women’s apparel, Frederick’s of Hollywood, needed a cloud strategy that merged stability, security and PCI compliance needs with the ability to scale up during retail busy season. In her session at 14th Cloud Expo, Malorie Lanthier, VP of Information Technology at Frederick’s of Hollywood, will discuss how migrating to the cloud enhanced the scalability, resiliency and flexibility of Frederick’s of Hollywood’s IT infrastructure while elevating the ability to address company-wide business objectives. She’ll also share her cloud evaluation process, successful practices used to make the wholesale move to the cloud and advice for organizations looking to develop and execute a holistic cloud plan quickly and effectively.
Mar. 16, 2014 01:00 PM EDT Reads: 1,370
In the Subscription Economy, purchase-to-use is winning, and purchase-to-own is on its way out. The key to success is that customers are actually using a product or service and in the Subscription Economy, the future is clear: delivering value drives revenue. Accessing this data to understand usage, and resulting customer value, is mandatory to survival. In his session at 14th Cloud Expo, Matthew Shanahan, SVP of Marketing and Strategy at Scout by Service Source, will discuss how businesses in industries such as telecom, SaaS, financial services, healthcare and education can, and are, leveraging data science methods to predict and avoid customer churn and increase revenue.
Mar. 16, 2014 11:00 AM EDT Reads: 1,195
Cloud environments have created situations that allow users, customers, consumers, and employees to access Public, Intranet, and Extranet applications from different locations, devices, and as different personas. The focus of all the Internet and enterprise front-end applications today is to enhance the user experience. In addition, with advances in mobility and BYOD, the line between public and private becomes a deep shade of gray. At the same time, organizations are leveraging SaaS applications, such as Google Apps and DropBox, for their internal business communication and collaboration. This opens up challenges in providing a universal identity for the user, while at the same time retaining the flexibility to segregate access depending on the scenario. While cloud computing environments may offer different levels of abstraction to its users, federated identity management does not leverage these abstractions; each user must set up her identity management solution. This situation is...
Mar. 15, 2014 03:00 PM EDT Reads: 1,453
Netflix has long been known for being a leader and innovator in delivering its Netflix service by leveraging public cloud service provider, Amazon Web Services (AWS). However, Netflix’s IT team wasn’t quite there, and still had many legacy applications and infrastructure in place. One thing is to start a platform from the ground up solely leveraging cloud. It’s an entirely different process to do so from a legacy environment to more SaaS based and infrastructure as a service. The goal for Netflix is to achieve 100% of this by beginning of 2015. Mike Kail joined Netflix in 2011 as VP of IT Operations and is tasked with ensuring that this migration happens as smoothly as possible.
Mar. 14, 2014 10:28 AM EDT Reads: 1,114
SYS-CON Events announced today that Ambernet Technologies, the innovative “Cloud Management Center” company, will exhibit at SYS-CON's 14th International Cloud Expo®, which will take place on June 10–12, 2014, at the Javits Center in New York City, New York. Ambernet Technologies is a leading global provider of cloud management software (CloudTruOps) and IT professional services to the enterprise, service provider and government markets. CloudTruOps is the industry’s first infrastructure-independent and service-aware software solution that provides a fully transactional single pane of glass for cloud service provisioning & orchestration, governance, policy, security, performance, self-service storefront, and billing/chargeback for multiple clouds. Ambernet's IT professional services provide consulting services, solutions, and support. Ambernet is a global company with headquarters in Dallas, Texas and regional offices in Toronto, Canada, and Bangalore, India.
Mar. 14, 2014 09:00 AM EDT Reads: 1,440
The evolutionary nature of mobile presents a security-centric challenge for businesses with corporate content on these devices. Enterprises put themselves at risk when users access sensitive information through email and applications across smartphones and tablets, while mobile. Organizations can choose to ignore this security threat or enhance employee productivity through secure corporate containers. In his session at 14th Cloud Expo, Eric Owings, an enterprise account executive at AirWatch®, will discuss best practices and strategies to ensure global security and workforce enablement by leveraging enterprise mobility management (EMM) across the enterprise. He will also provide attendees with a deeper understanding of enterprise mobility in a connected ecosystem, while ensuring security and compliance in the cloud.
Mar. 7, 2014 09:45 AM EST Reads: 1,888
Cascading is the popular Java-based application development framework for building Big Data applications on Apache Hadoop. This open source framework allows you to leverage existing skillsets such as Java, SQL, R, and more to create enterprise-grade applications without having to think in MapReduce. In his session at 5th Big Data Expo, Alexis Roos, a Senior Solutions Architect focusing on Big Data solutions at Concurrent, Inc., will give an introduction to Cascading, how it works, and then dive into how enterprises can start building applications with Cascading. Come and see how companies like Twitter, eBay, Etsy, and other data-driven companies are taking advantage of Cascading and how Cascading is changing the business of Big Data in the enterprise.
Mar. 4, 2014 11:15 AM EST Reads: 1,966
The world’s largest and most successful private cloud operations are revolutionizing their approach to demand management. These organizations have recognized that while self-service portals are a component in the overall cloud architecture, these tools do not enable demand management. In fact, in many cases the portals and end-user interfaces don’t actually capture anything to do with demand, but instead force the user to enter the capacity “supply” requirements that they think will meet their demands. This is very different. Large enterprises have recognized the need to look beyond immediate requests to also model the “pipeline” of new demands that will be coming down the road. It is only by capturing new immediate requirements, an understanding of the pipeline and what is running in environments that organizations can possibly hope to accurately model demand and properly allocate compute, storage and network resources.
Mar. 4, 2014 10:15 AM EST Reads: 1,998
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity. Without bringing these three elements together via Systems of Discover you either end up with an Internet of somethings and/or a big mess of data. In his session at @ThingsExpo, Mac Devine, a Distinguished Engineer at IBM, will focus on how to ensure businesses have the right plans in place for Systems of Discovery for the Internet-of-Things world we are entering.
Mar. 4, 2014 09:00 AM EST Reads: 2,546
Nominations for participating vendors will be accepted through Twitter at @ThingsExpo. The "Open Cloud Shoot-Out at @ThingsExpo New York," in which leading cloud providers are expected to participate, will be held live on stage at the event. The Shootout will provide the vendors with an opportunity to demonstrate the features and capabilities of their products, with a particular focus on interoperability, scalability, security, and reliability in terms of development, deployment, and management.
Feb. 25, 2014 02:30 PM EST Reads: 2,526
As businesses aspire to move more and more application workloads outside of the boundaries of their private cloud data centers, public cloud service providers are increasingly implementing a private cloud staple: resiliency. In his session at 14th Cloud Expo, John Roese, SVP and Chief CTO at EMC Corporation, will summarize the key architectural tenets of resilient private cloud architectures. These tenets can be implemented in any service provider cloud implementation, regardless of hypervisor choice (e.g., VMware, Hyper-V, Xen), cloud orchestration software (e.g., vSphere, OpenStack), network implementation (e.g., SDN, NFV), or storage implementation (file, block, object). A resilient public cloud will naturally attract increased workload migration, and the rest of the session will describe foundational technologies that facilitate not only secure and seamless application workload migration, but secure and seamless data set migration as well.
Feb. 25, 2014 11:00 AM EST Reads: 2,136
Fueled by the global economic situation, the government's focus on datacenter consolidation and the "Cloud First" initiative, Cloud Computing continues to be the buzzword of the year. As government agencies start to adopt cloud computing, additional challenges including security in the cloud have become prominent barriers to adoption. In his session at 14th Cloud Expo, Majed Saadi, Director of the Cloud Computing Practice at SRA International, will focus on providing a quick Cloud Computing technology update with an emphasis on current Cloud Computing security trends and drivers. Examples of these trends include: the utilization and evaluation of Clouds in both active and passive surveillance systems and the use of High Performance Clouds for expanding scientist ability to access data. He will also introduces best practices and lessons learned for securing both public and private cloud environments. It offers insight into how Cloud Computing coupled with other technical advancements i...
Feb. 24, 2014 09:45 AM EST Reads: 2,568
With Windows Server 2003 end of extended support approaching, enterprises must begin their migration planning for all affected production applications. There are a variety of approaches and many people will take a “mix and match” approach. Whatever the approach, it’s important to have a migration plan now – 200 business days goes by quickly when some applications take weeks to migrate. This is the perfect opportunity to move those applications to the Cloud. There’s a way to move your applications and modernize (move to the cloud) at the same time.
Feb. 23, 2014 11:30 AM EST Reads: 1,905
Software development, like engineering, is a craft that requires the application of creative approaches to solve problems given a wide range of constraints. However, while engineering design may be craftwork, the production of most designed objects relies on a standardized and automated manufacturing process. By contrast, much of what's typically involved when moving an application from prototype to production and, indeed, maintaining the application through its lifecycle remains craftwork.
Feb. 22, 2014 01:30 PM EST Reads: 1,983