Jon Hodgson’s Hidden in Plain Sight, Part 1: The Flaw of Averages

Examine your CPU usage for a quick way to reveal application-performance problems. If you see usage at an all-too-perfect percentage, such as 25% or 50%, then you likely have a problem, says Jon Hodgson, global consulting engineer at Riverbed Technology. Hodgson is a Riverbed APM subject-matter expert with a background in systems administration, networking, and programming and application architecture. He’s helped hundreds of organizations optimize their mission-critical applications worldwide.

“In the chaotic world that is computing, nothing runs naturally at exactly 25%,” he says. “Something artificial must be happening there — and that’s a red flag.”

Simple math explains why: Let’s say your server has four CPU cores. Something behind the scenes goes haywire. One core pegs at 100%, while the other three have low loads varying from 0% or greater. In your monitoring tool, you will see a variety of samples, but all with a too-perfect floor of 25%, meaning the values never dip below that value. Find the cause of the spiking utilization, and you’ll likely find a cause of slow applications.

Hidden in plain sight

It’s a nice example of a problem that’s hidden in plain sight, which is the subject of this series based on the experiences of Hodgson. We’ll examine common application-performance problems that only reveal themselves when you look at them from the right vantage point, with tools that can capture all of the data, not just some it. “A whole category of problems exhibit themselves as slow code and slow customer transactions,” says Hodgson. “But hidden from view are underlying infrastructure causes—like the CPU example.”

The flaw of averages claims another victim

The flaw of averages, the subject of this article, can hide problems in plain sight. Professor Sam Savage first explained the concept in his October 8, 2000, article in the San Jose Mercury News. In that article he says: “The Flaw of Averages states that: Plans based on the assumption that average conditions will occur are usually wrong. A humorous example involves the statistician who drowned while fording a river that was, on average, only three feet deep.”

Figure 1. "A humorous example involves the statistician who drowned while fording a river that was, on average, only three feet deep." Source: The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty by Sam L. Savage, with illustrations by Jeff Danziger.

Figure 1. “A humorous example involves the statistician who drowned while fording a river that was, on average, only three feet deep.” Source: The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty by Sam L. Savage, with illustrations by Jeff Danziger.

For application-performance troubleshooting, why shouldn’t you rely on averages? According to Hodgson, it’s because “there is significant variety in what you’re monitoring.”

The fact is, the vast majority of transactions for most applications are likely OK. When you take that huge number of OK transactions and roll into the statistics the poor-performing outliers, everything will still look good — the outliers get averaged out and hidden from sight.

Tools that collect only a sampling of data compound the flaw of averages problem, says Hodgson. “Because Riverbed takes a big data approach — we collect all the data with no sampling — we can look at the actual distribution for every transaction for every user.”

The flaw of averages can hide problems behind data that looks just fine from one angle, but is clearly not OK when looked at with the right tool.
– Jon Hodgson, global consulting engineer at Riverbed Technology

To chase ghosts you need a ghost detector

Hodgson has a favorite way to illustrate this, again stemming from CPUs. Let’s say you’re measuring CPU load with SteelCentral™ AppInternals, which collects data every second. As Figure 2 shows, you’re intermittently hitting 100% CPU usage with other times heavily utilized. These spikes show where slowness will occur.

Figure 2

Figure 2

It’s a different story when you look at the exact same data using a tool with only 15-second granularity, as seen in Figure 3. No problem here, right? “In both scenarios, the problem exists, but at 15-second granularity you’re not aware of it,” says Hodgson. “The information averages out and simply slips through the cracks.”

Figure 3

Figure 3

Forgotten freeware claims 10,000 CPUs

One day Hodgson visited a customer and immediately spotted a 16-core machine where CPU usage never dipped below a 6% floor. The company’s IT team thought the machine was just fine, and instead kept looking elsewhere for performance problems.

But, as you now know, that 6% floor should have been a red-flag example of the too-perfect number discussed above. (Do the math: 100% usage for the entire server, divided by 16 cores, is 6.25% per core.) Using SteelCentral AppInternals, Hodgson quickly discovered that a little freeware system-administration utility, not updated since 2004, was by itself devouring one entire CPU of the 16 available.

Worse, Hodgson then discovered that the offending freeware utility was part of the default build for more than 10,000 company servers. Every one of those 10,000 servers had one of their cores locked up by this freeware utility, which was affecting thousands of applications and countless end-user transactions.

“And no one knew about it because it was hidden in plain sight, wasting resources and killing performance” says Hodgson. “But by looking at it with the right tool, we recovered processing time equivalent to 10,000 CPUs and a lot of unexplained problems immediately disappeared — at little cost or effort.”

A final thought: Fixing problems reveals other problems

I asked Hodgson whether fixing overarching performance problems would then reveal other problems. He said “yes,” in two important ways: First, users change their behavior by starting to use the newly performing service more frequently or in different ways. That puts new strain on resources because “something painful is no longer painful.”

And second, troubleshooters will notice other problem areas previously hidden from view by the terrible effects of the first problem. To illustrate, Hodgson says to picture an assembly line in which Joe takes 40 seconds to do something. When you fix Joe’s performance, you suddenly realize that Bob, the next person in the chain, wasn’t performing either, but he was previously masked by Joe’s awful performance. This is often referred to as “the problem moving downstream.”

Chalk that up to another example of problems hidden in plain sight. For another look at troubleshooting application performance, see 3 Guidelines for Troubleshooting Performance Issues. Next time we’ll look at the best way to find the needle in the haystack for production-performance problems by removing the haystack.

Leave a Comment

Leave a Reply

© 2019 Riverbed Technology

All rights reserved. Riverbed and any Riverbed product or service name or logo used herein are trademarks of Riverbed Technology. All other trademarks used herein belong to their respective owners. The trademarks and logos displayed herein may not be used without the prior written consent of Riverbed Technology or their respective owners.