Deliberate null pointer dereferences in the Linux Kernel

Posted by Andy, Comments

A recent debate on the Linux kernel list about some Coverity defect reports highlights some interesting questions about how to interpret static analysis results.

Here's the patch in question:

---
xfs: fix possible NULL dereference in xlog_verify_iclog

In xlog_verify_iclog a debug check of the incore log buffers prints an
error if icptr is null and then goes on to dereference the pointer
regardless.  Convert this to an assert so that the intention is clear.
This was reported by Coverty.

Reported-by: Geyslan G. Bem 
Signed-off-by: Ben Myers 
---
 fs/xfs/xfs_log.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)
Index: b/fs/xfs/xfs_log.c
===================================================================
--- a/fs/xfs/xfs_log.c  2013-10-23 14:52:47.875216875 -0500
+++ b/fs/xfs/xfs_log.c  2013-10-23 14:53:53.775245830 -0500
@@ -3714,11 +3714,9 @@ xlog_verify_iclog(
    /* check validity of iclog pointers */
    spin_lock(&log->l_icloglock);
    icptr = log->l_iclog;
-   for (i=0; i < log->l_iclog_bufs; i++) {
-       if (icptr == NULL)
-           xfs_emerg(log->l_mp, "%s: invalid ptr", __func__);
-       icptr = icptr->ic_next;
-   }
+   for (i=0; i < log->l_iclog_bufs; i++, icptr = icptr->ic_next)
+           ASSERT(icptr);
+
    if (icptr != log->l_iclog)
        xfs_emerg(log->l_mp, "%s: corrupt iclog ring", __func__);
    spin_unlock(&log->l_icloglock);
--

You can read the debate on LKML. The checker that detects this defect attempts to identify code that follows this pattern:

if(p == NULL)
    ...
*p

where the body of the if-statement does not reassign p to a non-null value. The algorithm the checker uses is much more advanced than a simple pattern match, so it can deal with variations such as passing p into a function which subsequently dereferences it arbitrarily deep in the call stack. On its face, this kind of code presents a contradiction. If the pointer can actually be null, then the dereference causes the code to crash. If the pointer can never be null, then the test if(p == NULL) is unnecessary. Either way, the code signals disagreement with itself about the potential values of p.

An unnecessary test is usually a lot less harmful than a null pointer dereference. But for our purposes, we will count either as a defect. One motivation for removing unnecessary null tests is to enhance the precision of the analysis. For example, removing an unnecessary null test of a function parameter tells the analysis the function unambiguously expects the parameter to be non-null. This expectation is automatically propagated to callsites and then enforced. The analysis can then detect more serious defects caused by the callers of the function passing in null values.

Clarity of intent also helps other human developers to make sense of the code. When collaborating with other developers to maintain code, clarity and consistency can be a great aid in avoiding misunderstanding and speeding up code comprehension.

But we're realists. Not all code is going be perfectly clear, and the mantra of "it works" is deeply ingrained in the collective developer psyche. And beyond that, there are legitimate disagreements about what constitutes clarity. That's why we tune the checker carefully by identifying idioms developers have when writing code in the real world. By eliminating code that follows these idioms, we can reduce the noise level in the results, leaving mostly real defects remaining. Sampling across hundreds of millions of lines of code shows that this checker has a low false positive rate - less than 10% for most code bases.

A simple example of an idiom is a killpath, a function that terminates execution. Common killpath functions include abort() and panic(). There are also synthetic killpath functions that extend this concept, for example by logging a message and then calling a primitive killpath function.

Taking the example above, if we fill in the blank with a killpath function, the contradiction in the code suddenly disappears:

if(p == NULL)
    abort();
*p

Now, the path where p is null does not continue past abort(), so there's never a null dereference. You might ask, why bother with the abort if the pointer dereference will crash? If you replace the abort with a call to a logging function, then the program terminates on the null pointer dereference instead of an abort() - a marginal difference - plus you have a log entry to debug the problem. This is precisely the debate on the LMKL.

We recognize these situations occur. When they do, there are several options, some of which are:

  1. Triage the defect as Intentional or False Positive. By marking the defect in the Coverity database, you won't be bothered by it again. This requires having a trusted process for triage and review, otherwise someone could unilaterally ignore an important defect.

  2. Modeling. It might be worthwhile to tell the analysis that xfs_emerg is a killpath function. I am no XFS developer, but my reading of the code says that xfs_emerg is a logging function that signals an "XFS emergency". If so, then a crash shortly after a call to such a function is expected. It is easy to tell the analysis to treat xfs_emerg as a killpath, so defects containing path traces that go past it won't be reported.

  3. Components. Perhaps defects found in the file xfs_log.c are uninteresting because they are only used in debugging modes. A component can be created to segregate this file and temporarily hide the defects in it. This technique is often useful to prevent test code, 3rd party code, and other lower priority areas from drowning out defects in more important areas.

  4. Checker options. The analysis has hundreds of options to affect the behavior of the algorithms. Some of these options turn on detection of certain idioms that we determined to not be suitable as a default. They might work perfectly well for a specific code base, only experimentation will tell. In this case there don't appear to be any that affect this defect report.

  5. Fix the issue. Sometimes, it's just better to change the code. If it's confusing for the analyzer, it's potentially confusing to other human developers too. But if the fix is made solely to "shut the analysis up" then it's likely to be a missed opportunity. Often, examining the code nearby can be worthwhile as well. Not taking this opportunity is like not fully washing your hands after going to the bathroom. It's unhygienic and it's a false economy.

The bigger picture is that the decision about what to do with static analysis findings is not a purely technical one. If developers feel the analysis is forcing them to do a lot of "extra work" that yields little benefit, they might ignore it entirely. Gaining the trust and adoption of developers is more important than winning a technical debate on any particular bug, or covering any particular defect type. Ultimately it's about what standards developers want to hold themselves accountable to. The analysis is just an algorithm doing its job, trying to enforce some measure of consistency against that standard.

Comments!