Eric Lippert Dissects CVE-2014-6332, a 19 year-old Microsoft bug

Posted by Eric, Comments

Today's Coverity Security Research Lab blog post is from guest blogger Eric Lippert.

[UPDATE 1: The MISSING_RESTORE checker regrettably doesn't find the defect in the code I've posted here. Its heuristics for avoiding false positives causes it to suppress reporting, ironically enough. We're working on tweaking that heuristic for an upcoming release.]

It was with a bizarre combination of nostalgia and horror that I read this morning about a 19-year-old rather severe security hole in Windows. Nostalgia because every bit of the exploited code is very familiar to me: working on the portion of the VBScript engine used to exploit the defect was one of my first jobs at Microsoft back in the mid-1990s. And horror because this is really a quite serious defect that has been present probably since Windows 3.1, [Update 2: heard that Windows 3.1 is in fact not affected, so you IE 2-5 users are safe ;)] and definitely exploitable since Windows 95. Fortunately we have no evidence that this exploit has actually been used to do harm to users, and Microsoft has released a patch. (Part of my horror was the fear that maybe this one was my bad, but it looks like the actual bug predates my time at Microsoft. Whew!)

The thirty-thousand foot view is the old familiar story. An attacker who wishes to run arbitrary code on a user's machine lures the user into browsing to a web page that contains some hostile script -- VBScript, in this case. The hostile script is running inside a "sandbox" which is supposed to ensure that it only does "safe" operations, but the script attempts to force a particular buggy code path through the underlying operating system code. If it does so successfully, it produces a corrupt data structure in memory which can then be further manipulated by the script. By cleverly controlling the contents of the corrupted data structure, the hostile script can read or write memory and execute code of their choice.

Today I want to expand a bit on Robert Freeman's writeup, linked above, to describe the underlying bug in more detail, the pattern that likely produced it, better ways to write the code, and whether static analysis tools could find this bug. I'm not going to delve into the specifics of how this initially-harmless-looking bug can be exploited by attackers.

What's so safe about a SAFEARRAY?

Many of the data structures familiar to COM programmers today, like VARIANT, BSTR and SAFEARRAY, were created for "OLE Automation"; old-timers will of course remember that OLE stood for "object linking and embedding", the "paste this Excel spreadsheet into that Word document" feature. OLE Automation was the engine that enabled Word and Excel objects to be accessed programmatically by Visual Basic. (In fact the B in BSTR stands for "Basic".) Naturally, Visual Basic uses these data structures for its representations of strings and arrays. The data structure which particularly concerns us today is SAFEARRAY:

typedef struct tagSAFEARRAY 
{
  USHORT         cDims;         // number of dimensions
  USHORT         fFeatures;     // type of elements
  ULONG          cbElements;    // byte size per element
  ULONG          cLocks;        // lock count
  PVOID          pvData;        // data buffer
  SAFEARRAYBOUND rgsabound[1];  // bounds, one per dimension
} SAFEARRAY;

typedef struct tagSAFEARRAYBOUND
{
  ULONG cElements; // number of indices in this dimension
  LONG  lLbound;   // lowest valid index
} SAFEARRAYBOUND;

SAFEARRAYs are so-called because unlike an array in C or C++, a SAFEARRAY inherently knows the dimensionality of the array, the type of the data in the array, the number of bytes in the buffer, and finally, the bounds on each dimension. How multi-dimensional arrays and arrays of unusual types are handled is irrelevant to our discussion today, so let's assume that the array involved in the attack is a single-dimensional array of VARIANT.

The operating system method which contained the bug was SafeArrayRedim, which takes an existing array and a new set of bounds for the least significant dimension -- though again, for our purposes, we'll assume that there is only one dimension. The function header is:

HRESULT SafeArrayRedim(
  SAFEARRAY      *psa,
  SAFEARRAYBOUND *psaboundNew
)

Now, we do not have the source code of this method, but based on the description of the exploit we can guess that it looks something like the code below that I made up just now.

Bits of code that are not particularly germane to the defect I will omit, and I'll assume that somehow the standard OLE memory allocator has been obtained. Of course there are many cases that must be considered here -- such as "what if the lock count is non zero?" -- that I am going to ignore in pursuit of understanding the relevant bug today.

As you're reading the code, see if you can spot the defect:

{
  // Omitted: verify that the arguments are valid; produce
  // E_INVALIDARG or other error if they are not.

  PVOID pResourcesToCleanUp = NULL; // We'll need this later.
  HRESULT hr = S_OK;

  // How many bytes do we need in the buffer for the original array?
  // and for the new array?

  LONG cbOriginalSize = SomehowComputeTotalSizeOfOriginalArray(psa);
  LONG cbNewSize = SomehowComputeTotalSizeOfNewArray(psa, psaboundNew);
  LONG cbDifference = cbNewSize - cbOriginalSize;

  if (cbDifference == 0)
  {
    goto DONE;
  }

  SAFEARRAYBOUND originalBound = psa->rgsabound[0];
  psa->rgsabound[0] = *psaboundNew;
  // continues below ...

Things are looking pretty reasonable so far. Now we get to the tricky bit.

Why is it so hard to shrink an array?

If the array is being made smaller, the variants that are going to be dropped on the floor might contain resources that need to be cleaned up. For example, if we have an array of 1000 variants containing strings, and we reallocate that to only 300, those 700 strings need to be freed. Or, if instead of strings they are COM objects, they need to have their reference counts decreased.

But now we are faced with a serious problem. We cannot clean up the resources after the reallocation. If the reallocation succeeds then we no longer have any legal way to access the memory that we need to scan for resources to free; that memory could be shredded, or worse, it could be reallocated to another block on another thread and filled in with anything. You simply cannot touch memory after you've freed it. But we cannot clean up resources before the reallocation either, because what if the reallocation fails? It is rare for a reallocation that shrinks a block to fail. While the documentation for IMalloc::Realloc doesn't call out it can fail when shrinking (doc bug?), it doesn't rule it out either. In that case we have to return the original array, untouched, and deallocating 70% of the strings in the array is definitely not "untouched".

The solution to this impass is we have to allocate a new block and copy the resources into that new block before the reallocation. After a successful reallocation we can clean up the resources; after a failed reallocation we of course do not.

  // ... continued from above
  if (cbDifference < 0)
  {
    pResourcesToCleanUp = pmalloc->Alloc(-cbDifference);
    if (pResourcesToCleanUp == NULL)
    {
      hr = E_OUTOFMEMORY;
      goto DONE;
    }
    // Omitted: memcpy the resources to pResourcesToCleanUp
  }

  PVOID pNewData = pmalloc->Realloc(psa->pvData, cbNewSize);
  if (pNewData == NULL)
  {
    psa->rgsabound[0] = originalBound;
    hr = E_OUTOFMEMORY;
    goto DONE;
  }
  psa->pvData = pNewData;

  if (cbDifference < 0)
  {
    // Omitted: clean up the resources in pResourcesToCleanUp
  }
  else
  {
    // Omitted: initialize the new array slots to zero
  }

  hr = S_OK; // Success!

DONE:

  // Don't forget to free that extra block.
  if (pResourcesToCleanUp != NULL)
    pmalloc->Free(pResourcesToCleanUp);
  return hr;
}

Did you spot the defect?

Part of the contract of this method is that when this method returns a failure code, the original array is unchanged. The contract is violated in the code path where the array is being shrunk and the allocation of pResourcesToCleanUp fails. In that case we return a failure code, but never restore the state of the bounds which were mutated earlier to the smaller values. Compare this code path to the code path where the reallocation fails, and you'll see that the restoration line is missing.

In a world where there is no hostile code running on your machine, this is not a serious bug. What's the worst that can happen? In the incredibly rare case where you are shrinking an array by an amount bigger than the memory you have available in the process, you end up with a SAFEARRAY that has the wrong bounds in a program that just produced a reallocation error anyways, and any resources that were in that memory are never freed. Not a big deal. This is the world in which OLE Automation was written: a world where people did not accidentally download hostile code off the Internet and run it automatically.

But in our world this bug is a serious problem! An attacker can make what used to be an incredibly rare situation -- running out of virtual address space at exactly the wrong time -- quite common by carefully controlling how much memory is allocated at any one time by the script. An attacker can cause the script engine to ignore the reallocation error and keep on processing the now-internally-inconsistent array. And once we have an inconsistent data structure in memory, the attacker can use other sophisticated techniques to take advantage of this corrupt data structure to read and write memory that they have no business reading and writing. Like I said before, I'm not going to go into the exact details of the further exploits that take advantage of this bug; today I'm interested in the bug itself. See the linked article for some thoughts on the exploit.

How can we avoid this defect? How can we detect it?

It is surprisingly easy to write these sorts of bugs in COM code. What can you do to avoid this problem? I wrote who knows how many thousands of lines of COM code in my early days at Microsoft, and I avoided these problems by application of a strict discipline. Among my many rules for myself were:

  • Every method has exactly one exit point.
  • Every local variable is initialized to a sensible value or NULL.
  • Every non-NULL local variable is cleaned up at the exit point
  • Conversely, if the resource is cleaned up early on a path, or if its ownership is ever transferred elsewhere, then the local is set back to NULL.
  • Methods which modify memory locations owned by their callers do so only at the exit point, and only when the method is about to return a success code.

The code which I've presented here today -- which I want to emphasize again I made up myself just now to illustrate what the original bug probably looks like -- follows some of these best practices, but not all of them. There is one exit point. Every local is initialized. One of the resources -- the pResourcesToCleanUp block -- is cleaned up correctly at the exit point. But the last rule is violated: memory owned by the caller is modified early, rather than immediately before returning success. The requirement that the developer always remember to re-mutate the caller's data in the event of an error is a bug waiting to happen, and in this case, it did happen.

Clearly the code I presented today does not follow my best practices for writing good COM methods. Is there a more general pattern to this defect? A closely related defect pattern that I see quite often in C, C++, C# and Java is:

someLocal = someExternal;
someExternal = differentValue;
DoSomethingThatDependsOnTheExternal();
//... lots of code ...
if (someError) return;
//... lots of code ...
someExternal = someLocal;

And of course the variation where the restoration of the external value is skipped because of an unhandled exception is common in C++, C# and Java.

Could a static analyzer help find defects like this? Certainly; Coverity's MISSING_RESTORE analyzer finds defects of the form I've just described. (Though I have not yet had a chance to run the code I presented today through it to see what happens.)

There are a lot of challenges in designing analyzers to find the defect I presented today; one is determining that in this code the missing restoration is a defect on the error path but correct on the success path. This real-world defect is a good inspiration for some avenues for further research in this area; have you seen similar defects that follow this pattern in real-world code, in any language? I'd love to see your examples; please leave a comment if you have one.

Comments!