4 IT Ops metrics for better root-cause analysis

It happened years ago, but I can still remember it as if it were yesterday. Erroneous data that began to appear within a customer service application. Employees and customers became frustrated, and the problem cost the company more than $5,000 an hour. One customer received a bill for $8 million.

As we searched for the cause, we suspected that the database was failing. However, all subsystems checked out when we ran diagnostics. We did find I/O errors in the database log file, and that led us to a faulty storage system where some of the data existed. Problem found? Not yet.

We replaced the storage system in the data center for a mere $100,000. The problem seemed to go away—for a while. But then erroneous data appeared again, this time in another part of the application.

The same symptoms appeared, including errors in the log file, which led to the storage system that we had just replaced. The storage provider came out and determined that the power supply for the storage system was producing "dirty power." This meant spikes that quickly fried the disk controller. Thus the I/O errors, thus the errors in the log file, and thus the corrupted data that was externalized by the application. Problem found.

In the example above, the root cause was a faulty power supply, and not the storage system or the database. However, we had to go through several weeks of replacing things that were not broken to reach that determination. All in all, the enterprise lost over a half-million dollars over those weeks, and employees and customers lost faith in the IT department.

Enter the notion of root-cause analysis, or RCA, as a core operations discipline. RCA is really just a method of problem solving that focuses on identifying the root cause of faults or other issues.

How to Set Up and Operate Hybrid Cloud Environments

Metrics that matter

What are the core factors considered in RCA? Something is considered a root cause if its removal from a problem-fault sequence means the final undesirable event does not recur. A causal factor is something that affects the event's outcome but is not the root cause, such as the database from the above example. If we had loaded a new version of the database, for example, it would not have fixed the root cause of the problem.

RCA is one of those topics that has attracted a lot of study because it reaches beyond IT. You can use RCA to identify mechanical issues, electrical issues, even people issues. But not much has been written about the metrics that IT Ops should employ when performing RCA.

Here are four metrics I consider to be core.

1. The component that presents the symptom is almost never the root cause  

The symptom component is the root cause in only about 5% of RCA IT cases. Thus, 95% of the time, you'll find that the root cause is a component further down the problem sequence.

As with the example above, the application producing erroneous data would naturally be the primary suspect. However, as in most cases, it was something else.

The key lesson learned in this example is that you should consider each step in the sequence, including application, database, I/O, storage, and power, as potential root causes, right out of the gate. To speed recovery, each component should be isolated and diagnosed in parallel.

Even if all systems check out fine, you need to look at the complete chain, and that includes looking at the data moving to and from the application in what's called a "data trace." In the end, there could be something with how the components interact, and thus it's the whole of the system, and not a single component.

2. The less obvious is typically the root cause  

In the example above, the database or the storage system were the obvious suspects. However, it ended up being a faulty power supply, which was never even considered until the second storage system failed. In 80% of failures, the RCA determines that the source of the failure was never included in the initial analysis.

An RCA pattern is beginning to emerge around the use of public, private, and hybrid clouds: The root cause of issues that arise at the application level, the infrastructure level, or the database level is typically traced to an entirely different level. The root cause of most problems encountered with cloud-based workloads is poorly configured cloud machine instances. This includes insufficient compute or storage allocations to the workloads to handle the required scaling.

But in almost all cases, the workloads themselves are blamed, or it's said to be "some bug in the cloud." Those doing RCA should start with the least obvious and move to the obvious. If you do that, chances are you'll find the root cause quicker.

3. Collect transaction performance metrics in real time at the "speed of need"  

IT Ops organizations that follow this pattern typically find root causes about 90% faster than IT Ops organizations that don't.

The upside is easy to understand: You gather data from components that may be the root cause of the problem. As the components change their behavior over time, such as more I/O errors or corrupt database records, you're able to take proactive measures.

For example, if I had been monitoring the failing power supply, I would have seen that the power was fluctuating beyond standard. Before it could damage the storage system, it could have been replaced.

This kind of monitoring is expensive. But in most cases, it pays for itself within six months, given the trouble it helps avoid. Moreover, it provides the ability to do RCA in a proactive rather than passive way, using real-time information to determine component-level health, as well as holistic health around all system components that could potentially fail.

There are two levels to this approach, micro and macro. At the micro level, you look at the core components, watching for externalized data that may indicate a forthcoming failure. 

At the macro level, you consider the data from each component in the context of all others—for example, the ability to know that I/O errors lead to corrupted data, which leads to erroneous information. Monitoring the system holistically, rather than as a single component, allows you to see down the problem sequence to quickly get at the root cause.

4. Take automated corrective action, which relieves humans from having to respond

While most of what's presented above assumes that humans are taking corrective action, the optimal RCA solution factors out the humans.

The benefits of this approach are easy to define. Systems run 7/24; humans do not. Instead of waking up some admin to correct a problem, an automated process can do RCA, and even self-correct.

In the example I used at the beginning of this article, one of the automated, person-less approaches would have been to hook up the power supply to a monitoring device. If the power was a certain percentage out of threshold, it would automatically switch to a backup supply and alert IT Ops to change out the faulty power supply ASAP. This would have led to no downtime and a savings of over a half-million dollars, minus the cost of the backup power supply and the IT Ops RCA software with automated corrective action.

While this seems like the ultimate in RCA and fault correction, most enterprises have not invested in this technology. RCA is still driven by humans, and humans correct the problems, typically through trial and error. The cost benefit of automated RCA is easy to see.

The benefits of RCA

As enterprises move to public clouds that have built-in RCA and redundancy, they are realizing the benefits of not having to track down each minor problem that ends with frustrated employees and customers.

To find the ultimate RCA for IT Ops that will best benefit the business, these best practices should be understood and applied in a stepwise order. Again, the problem is scale. Enterprises have between 5,000 and 8,000 application workloads and associated data, and each of those applications may have a dozen or so components bound to them, which are all potential root causes of problems.

It will take years for enterprises to walk down this path of doing better RCA, as well as automating everything that can be automated, with the objective of making most components self-correcting. The investment needs to start now. 

How to Set Up and Operate Hybrid Cloud Environments
Topics: IT Ops