‘Frames below may be incorrect’, or: Stack Walking Requires Symbols

Here’s the symptom – you stop and inspect a stack:

stack_dlg1_cut

Note the message at the second line:

Frames below may be incorrect and/or missing. No symbols loaded for XXXXX.dll.

Chances are you read it once years ago and ignored it ever since.  If so, that’s a shame – because the debugger is dead serious about it.

You load some missing symbols (either from the modules window, or by right clicking a line on the stack window), and the stack changes. Often, the code displayed (topmost frame where code is available) is seen to be misleading – it’s nowhere to be found on the updated stack!

stack_Dlg2_cut

This happens a lot, e.g. when uncaught exceptions are thrown outside your own code, when you pause execution, or when switching to a different thread while stopped at a breakpoint.  Essentially – it happens whenever stack walking has to start from an optimized module without loaded symbols (in particular, MS modules like ntdll above).

First Corollary

Loading MS-symbols is NOT optional!

Hopefully every dev in the civilized world knows what these are and how to get them (MS made it considerably easier since VS2008), so I will not rehash. What is not widely obvious, is that without them there’s a good chance you’re looking at wrong stacks.

Second Corollary

Apparently stack walking is harder than it seems, and depends in some way on debug information.

That was news to me, and called for some research.

Naive Stack Walking

[A full x86-stack-layout tutorial is an undertaking way beyond the extent of my spare time. Try e.g. here for a nice read. The following is a very rough and minimal description]

Ignoring most stack content (function parameters, local variables, calling conventions, exception handling, buffer security and such), a naive layout of an x86 stack frame is something like -

Memory address:   Stack elements:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 0x104FD8   |  | Parameters                 | \
            |  +----------------------------+  |
 0x104FD4   |  | Return address, routine 2  |  |
            |  +----------------------------+   >  Stack frame 3
 0x104FD0   +--| EBP value for routine 2    |  |
               +----------------------------+  |
 0x104FCC   +->| Local data                 | /  <-- Routine 3's EBP
            |  +----------------------------+
 0x104FC8   |  | Return address, routine 3  | \
            |  +----------------------------+  |
 0x104FC4   +--| EBP value for routine 3    |  |
               +----------------------------+   >  Stack frame 4
 0x104FC0      | Local data                 |  | <-- Current EBP
               +----------------------------+  |
 0x104FBC      | Local data                 | /
               +----------------------------+
 0x104FB8      |                            |    <-- Current ESP
                \/\/\/\/\/\/\/\/\/\/\/\/\/\/

Where the Extended Base Pointer slots store the contents of the register that marks the start of a stack frame. So obtaining a stack trace should be as simple as -

  1. Walking the chain of EBPs (each EBP slot on a stack frame points to just below the EBP slot of its parent frame) to form the stack of frames.
  2. From above each EBP slot, take the stored EIP and use it to decipher the calling module and the calling function name.

Phase (2) is indeed best done with symbols, but  can be achieved with some success (exported functions only) with only linker map files. Either way, this description cannot account for the dependency of the stack-frames partition itself upon the presence of symbols! There has to be more to the story.

First Exception – FPO

Ever since the Intel 386, both ESP and EBP can serve as reference points for local stack variables, and faced with the scarcity of registers on x86 systems it was a tempting optimization to drop EBP dedicated usage altogether. This is called Frame Pointer Omission, and indeed mandates dedicated stack-walking assistance info in the PDB – as the traditional EBP chain breaks completely once a single stack frame uses FPO. However, FPO have been rarely used in practice and completely disabled in MS builds since Vista so it cannot possibly account for all occurrences of bad stack traces.

Others Exceptions?

Well, probably – but not documented ones. I have several reasons to believe so:

(1) StackFrameTypeEnum (used by IDiaStackFrame) indeed includes FrameTypeStandard and FrameTypeFPO, but also other frame types (notably FrameTypeFrameData).

(2) The VC team blog did hint that a PDB includes, quote,

the unwind program to execute to walk to the next frame

(3) I’ve witnessed these symptoms on dumps taken from Win7 machines – and AFAIK Win7 binaries were compiled without FPO.

(4) The venerable John Robbins answered an email of mine, saying:

Yes, unless you have all symbols loaded for a native application, you can never truly trust the stacks. You’re right that MSFT no longer uses FPO, but if the symbols for a DLL in the stack are not loaded, the StackWalk64 API, which all debuggers use, goes through heuristics to walk the stack. Heuristics is a fancy name for guessing. J For example, if you don’t have the PDB files loaded for a module, the stack walking code will show you the closest symbol to an address. That symbol could actually come from another DLL. Once you load the PDB file, the correct symbol is available so the address symbol will change in the call stack window.

Can’t say I really got to the root of this dependence on PDBs, though. If anyone out there cares to shed more light over this, I’d love to hear!

Every Time You Try to Micro-Reuse Code, God Kills a Kitten.

- where by ‘macro-reuse’ I mean reusing objects or libraries (which is good), and by ‘micro-reuse’ I mean seeing a class that has some overlap with your needs and saying ‘hey, let’s derive from that and just override these two interfaces’, or seeing a function that looks similar to the one you were about to code and saying ‘hey, let’s call that and add an internal branch for my new case’ (which kills kittens).

Code reuse seems to hold the promise for software heaven. Forcing it into a paragraph, the mantra says: identify common functionality, code it at a single location, then use it all over the place. When a fix is needed, just apply it at this single location – and have your entire app enjoy it at no additional cost.

The world of software design seems to consider this an unquestionable truism[*]. Alas, it does not hold in (at least my subset of) reality.

I have virtually never seen code whose lifecycle followed this sterile path: code and maintain at a single place , enjoy everywhere. When some functionality is indeed made to service multiple clients, the real life probable scenario is that no one will ever, ever, dare to maintain it.

The cases where a product bug is identified that is truly internal to some well defined common functionality, regardless of its users, are rare at best. Much more often, the process is -

  1. The needs of a specific code-client change, or a behavioral bug is detected in a specific context.
  2. The poor soul who is tasked with the fix cannot possibly know of all other code clients and thus dares not touch the common functionality.
  3. The specific bug or requirements change are addressed with a patch at the client code.
  4. The common functionality quickly turns into a code fossil, never to be modified again.

Thus blind code reuse, intended to increase maintainability, ends up creating tight coupling – thereby vastly reducing maintainability.

I see this all the time, both in twisted flow control and overload of flow-control flags, and in ridiculous class hierarchies. I recently had to deal with ~6 layers of ~30 classes (!) aimed solely at reusing existing code. This was occasionally as few as 10 lines to reuse from a parent, but apparently someone was very literal (and very thorough) at interpreting the code-reuse mantra. The result is, of course, a huge spaghetti bowl that no one dares touch – one would always prefer to override over risking regression. Such code is also unmaintainable at another level: the functionality of a single concrete object is spread across as many as 6 classes! Just following the action paths or the content of various containers (modified throughout the hierarchy layers) was a formidable task.

______________________

[*] Only after writing this up did I find a software heavyweight that holds similar opinions.  Others have also articulately phrased similar objections specifically to reuse via inheritance