Jon Hodgson’s Hidden in Plain Sight, Part 2: Obliterating Haystacks
See why only a big-data approach will help you find that needle in the haystack — by removing the haystack.
“When it comes time to find that elusive needle in the haystack, the easiest method is to simply remove the haystack,” says Jon Hodgson, global consulting engineer at Riverbed Technology. “Or, to relate that example to application performance management (APM), you need to remove the noise to reveal what matters.”
Hodgson is a Riverbed APM subject-matter expert with a background in systems administration, networking, programming, and application architecture. He’s helped hundreds of organizations optimize their mission-critical applications worldwide.
Hidden in plain sight
This article looks at three examples of problems hidden in plain sight, which is the subject of this series based on the experiences of Hodgson. In the series, we examine common application-performance problems that reveal themselves only when you look at them from the right vantage point, with tools that can capture all of the data, not just some of it. “A whole category of problems exhibit themselves as slow code and slow customer transactions,” says Hodgson. “But hidden from view are underlying infrastructure causes.”
Decoding hidden messages
To illustrate Hodgson’s point, Figure 1 shows a seemingly undecipherable jumble of letters. But when you use the right tool, in this case 3D glasses, a pattern emerges (Figure 2).
The haystack: Transaction noise
Hodgson shares a more technical example. Most teams start by analyzing the slowest transactions, and then determine that the root cause is a slow piece of code. Figure 3 shows more than 2,000 transactions over an eight-minute timeframe. Performance complaints have been pouring in, so the APM team zeros in on the transactions at 10:17 a.m. that take between seven and nine seconds.
But if the team fixes those slow transactions, will the end-user complaints stop? “Just because certain transactions are the slowest does not mean they are the culprits affecting users most,” says Hodgson. “That old logic rings true: Correlation does not imply causation.”
The fact is, the data set in Figure 3 is a mix of many transaction types, each with their own typical performance range. That blend of behaviors masks issues lurking just below the surface. To get at root-cause problems, you need to look deeper by removing the haystack to reveal the needles hidden in plain site. And for that, you need a big-data approach that captures all transactions.
So how does this work in practice? Figure 4 shows the same transactions, but now you can distinguish between different transaction types. The blue transactions normally take about four seconds, but for about a minute some take about two times longer. These were the transactions the APM team focused on.
The faster red transactions, in contrast, normally take about 250ms or less. But every minute or so some take about 11 times longer — a much more severe problem than the original spike the APM team zeroed in on. Why? Hodgson says it’s “because 1) it’s a much larger change in behavior, 2) it affects more transactions, and 3) it occurs chronically.” In this case, the guilty party is an overloaded database — a completely different issue than the initial spike.
“Fixing the initial spike would not have cured the performance hit from the overloaded database,” says Hodgson. “But solving the database problem would give you the biggest performance bang for your buck.”
“If you capture only a subset of your transactions, you will solve only a subset of your problems.”
– Jon Hodgson, global consulting engineer at Riverbed Technology
To slice and dice, you need all data
To remove the haystack, you need a toolset that lets you slice and dice your dataset to reveal patterns not visible in the aggregate. “If you capture only a subset of your transactions, you will solve only a subset of your problems.” says Hodgson. It’s therefore critical that your APM solution capture all transactions, all the time, rather than relying on an incomplete sampling approach.
The needle: A sawtooth pattern
Hodgson demonstrates this point with a real example from a financial organization that does banking, stock trading, portfolio management, and the like. Regulatory demands required the APM team to ensure that the company’s applications could handle surges in traffic. In the past trading microbursts had caused systems to lock up, which can have cascading effects on the stock market.
To see what the applications would do under severe stress, the team ramped up traffic on a production-parallel platform comprising hundreds of servers to 3X the peak load on the highest trading day. “Under this test you’d expect to see throughput increase proportionately, until it plateaus when some resource like CPUs or a database becomes saturated,” says Hodgson. (See the red line in Figure 5.) “Instead, the IT team saw thrashing behavior, which indicated something completely different.”
So the question the IT team posed to Hodgson was: What’s causing the stalling, then the drop in throughput, then a surge, culminating with thrashing? “And by the way, this was a death spiral from which the environment could not recover, which is an even worse situation” says Hodgson.
To get to the root cause, they installed SteelCentral AppInternals, which was able to record all transactions, even at thousands of hits per second. Hodgson then compared the throughput chart from the load generator with the response-time chart from AppInternals, shown in Figure 6. “You’ll notice a distinct sawtooth pattern,” says Hodgson. “And remember from Part 1 that rigid patterns like that indicate something artificial is happening.”
Closer inspection also revealed that the sawtooth pattern preceded every burst of traffic. There’s a dip, a surge, a dip, a surge, and on and on. So the key was to figure out what caused the sawtooth pattern.
Next, Hodgson looked at a handful of slow transactions and noticed that many experienced delays when calling a remote web service named GetQuotes.jws — a stock-ticker application managed by a different team in the firm.
Hodgson warns against assumptions: “When you fix the slowest few transactions you can’t assume you’ll also be fixing millions of other transactions. Just because the top-five slowest transactions are due to reason X, it doesn’t mean transactions six through 10,000 are also slow because of X,” he says.
But, since its not practical to analyze 10,000 transactions individually, Hodgson recommends analyzing them as an aggregate collection. Then, if the overall root cause is in fact reason X, you’ll prove that fixing it will fix all 10,000 transactions.
To test his theory, Hodgson then focused his analysis on only those transactions that made remote calls to the GetQuotes.jws web service. The sawtooth remained, but many other transactions were filtered out, further confirming his initial hypothesis. “In this step we focused only on the needles, those transaction that call GetQuotes.jws, by filtering out the other transactions that comprise the haystack,” he says.
Although this information was compelling on its own, Hodgson wanted to be absolutely certain before making recommendations that would ultimately lead to a lot of engineering work and potentially cause political turmoil. So as a final confirmation, he tested his theory’s inverse: Show only the transactions that do not call GetQuotes.jws:
Eureka, the sawtooth pattern, and the trashing behavior, completely disappears. That confirmed the theory beyond a shadow of a doubt: GetQuotes.jws, a shared downstream web service, was the culprit for millions of slow transactions.
“In this example, we removed the haystack to reveal all the needles, and then figured out what they all had in common to identify the singular thing that needed to be fixed,” says Hodgson. “We then used SteelCentral AppInternals to determine that this issue affected hundreds of different transaction types in dozens of seemingly unrelated applications, which gave the business clear justification to fix it.”
Engineers had previously spent months investigating the code and the servers, but they could never pinpoint the issue. “But in a couple of hours with SteelCentral AppInternals, we located the problem, quantified its effects, and armed the APM team with the evidence necessary to take action,” says Hodgson. At the end of the day, this is just another way SteelCentral AppInternals, and its big-data approach, can help reveal performance problems that are hidden in plain sight.
Next time we’ll explore a common but particularly vexing issue: Seemingly random intermittent slowness moving from one part of your application to another. Without the proper tools and methodologies, you might never identify this phantom’s root cause. In the next issue, we’ll discuss how to rapidly identify some of the most common causes and expel these specters from your applications for good.