Playing With Strings

November 20, 2009 by Ofek Shilon

Take the following code:

	CString str1("Startt"),
			str2("Start\0");
	str1.SetAt(str1.GetLength()-1, '');

	str1 += "End";
	str2 += "End";

What would you see when watching the resulting strings? Probably not what you expect:

This is a simplified version of a much dirtier, very real bug I dealt with recently. Several string and debugger features joined forces to cause this behaviour.

First – the debugger: it apparently watches CStrings as c-strings – displaying their essentially-LPTSTR member m_pszData.  Thus, any null embedded in the string (well, the first null, really) is treated as a terminating null – anything past it would not be displayed. When we force a watch on the full CString buffer, a fuller picture is revealed:

So the ‘End’ suffix was added to str1 after all – but why the difference between str1 and str2?  How can initializing a string with an embedded null be any different than setting that null in the next line? The next clue is obtained by observing GetLength() for both strings. Note that GetLength returns the length of the allocated string buffer, not the strlen of the underlying c-string. (It is utterly unimaginable that such a basic behaviour goes undocumented.)


So, str1 and str2 are indeed somehow different before adding the ‘End’ suffix. In fact, they are fundamentally different even before manually setting the null in str1:

	CString str1("Startt"),
			str2("Start\0");

	int len1 = str1.GetLength(),	// gives 6
		len2 = str2.GetLength();	// gives 5 !

The issue now has nowhere left to hide. Stepping with the debugger into the CString ctors reveals the root cause: the constructor used for both CStrings accepts a char*-type as argument (in retrospect – how could it be otherwise?). So, just like in the debugger itself, the first embedded null is treated as a terminating null – anything past it would never make it into the CString. Try the following and see for yourself:

	CString str3("First\0Second"); // str3 now contains only "First" !

Once this root cause was understood, the bug was a half-line fix.

Thanks and kudos go to Alexander M. of wordpress support, who found and fixed within 1 hour (!) a wordpress bug that I reported, to make this post possible: until yesterday, wordpress would ignore explicit nulls (backslash + zero) between quotes, in a sourcecode section.

OptimizedMesh DirectX Sample Having Issues With Large Meshes

November 18, 2009 by Ofek Shilon

The DirectX SDK comes with quite a few nice samples, neatly organized in a sample browser. Quoting the documentation from the OptimizedMesh sample:

This OptimizedMesh Sample sample demonstrates the different types of meshes D3DX can load and optimize, as well as the different types of underlying primitives it can render. An optimized mesh has its vertices and faces reordered so that rendering performance can be improved.

Sadly, it turns out the code as is cannot load meshes with more than 64K vertices (much less optimize them). Now I’m sure somewhere in the SDK a disclaimer is buried, saying there’s no warranty, this isn’t production code, the usual yadda yadda. Still , seemed to me like optimizing meshes is a topic that is of interest mostly to an audience dealing with large meshes (certainly I was), so this really deserves a fix.

The sample browser comes with neat ‘feedback’ links, and I did communicate this to MS a while ago. They never did get back to me, so I thought someone out there might benefit from the fix online.

In the main source file, OptimizedMesh.cpp, make the following addition:

...
// Load the mesh from the specified file
hr = D3DXLoadMeshFromX( strMesh, D3DXMESH_SYSTEMMEM, pd3dDevice,
      ppAdjacencyBuffer, &pD3DXMtrlBuffer, NULL,
      &g_dwNumMaterials, &pMeshSysMem );

if( FAILED( hr ) )
   goto End;

if(pMeshSysMem->GetOptions() && D3DXMESH_32BIT)
   g_dwMemoryOptions |= D3DXMESH_32BIT;

// Get the array of materials out of the returned buffer, and allocate a texture array
d3dxMaterials = (D3DXMATERIAL*) pD3DXMtrlBuffer->GetBufferPointer();
...

In a nutshell, the culprit is a tragic legacy of DirectX mesh files: by default, meshes allocate only 16 bit for a vertex index in the stored index buffer. Thus, meshes with more than 2^16 vertices require some explicit treatment – as listed here.

Coders at Work

November 5, 2009 by Ofek Shilon

I started reading Coders at Work, and it is just as good as Jeff and Joel say. The Jamie Zawinski chapter is brilliant. Brad Fitzpatrick  – while he may be an exceptional developer, he’s a ‘wow, like, dude!’  kind of speaker, and not much fun to read. The real highlight for me (so far) is Peter Norvig.

So far I’ve been successfully avoided the temptation of rehashing stuff in this blog, but the Norvig interview is just too good. Every single paragraph in his interview is worth hanging as an office poster.  (Plus, unlike the Zawinski interview, I haven’t seen it quoted around that much yet). Here are a few of his words, that are a real lesson to live by:

 

Seibel: How do you avoid over-generalization and building more than you need and consequently wasting resources that way?
Norvig: It’s a battle. There are lots of battles around that. And, I’m probably not the best person to ask because I still like having elegant
solutions rather than practical solutions. So I have to sort of fight with myself and say, “In my day job I can’t afford to think that way.” I have to say, “We’re out here to provide the solution that makes the most sense and if there’s a perfect solution out there, probably we can’t afford to do it.” We have to give up on that and say, “We’re just going to do what’s the most important now.” And I have to instill that upon myself and on the people I work with. There’s some saying in German about the perfect being the enemy of the good; I forget exactly where it comes from—every practical engineer has to learn that lesson.

Seibel: Why is it so tempting to solve a problem we don’t really have?

Norvig: You want to be clever and you want closure; you want to complete something and move on to something else. I think people are built to only handle a certain amount of stuff and you want to say, “This is completely done; I can put it out of my mind and then I can go on.” But you have to calculate, well, what’s the return on investment for solving it completely? [My emph - OS] There’s always this sort of S-shaped curve and by the time you get up to 80 or 90 percent completion, you’re starting to get diminishing returns. There are 100 other things you could be doing that are just at the bottom of the curve where you get much better returns. And at some point you have to say, “Enough is enough, let’s stop and go do something where we get a better return.”

Editing Binary Resources with VS

October 16, 2009 by Ofek Shilon

The need occasionally arises to modify binary resources without re-compilation. Say you want to change the manifest-dependencies of a dll you don’t have the source to.  Or you wish to bump up the version of an executable without actually working on it, exactly as, ahem, a good friend of mine sometimes does.

A quick search will get you tons of free and commercial dedicated tools for the task.  I accidentally learnt that you already have such a tool. It’s called Visual Studio.

Just 0pen your binary (exe, dll, ocx etc.) as a regular file from the menu (you can’t drag-n-drop it in). All the file resources are there on the screen, for you to abuse.

Duplicate Volume Serial Numbers

October 14, 2009 by Ofek Shilon

We recently released a product version, with yearly licenses attached to the machine’s Volume Serial Number.  Now it is called a ’serial number’, and it seems as meaningless and as random as a UID (mine is 34EE-10A0), so it must be a UID. Right?

Well, not quite. This ID characterizes a volume, not a disk. If you have a partitioned disk, just type at a command prompt  ‘dir c:’ and ‘dir d:’ (or whatever) and watch your partitions’ different VSNs. As the link teaches, the VSN data is part of the partition’s extended boot sector, and is no more then a hash of the partition-creation date & time (i.e., disk formatting date & time).  So, it’s not technically unique – if any two disks are formatted (or partitions created) at the exact same time, they’d have identical VSN. Also – since its only 4 Bytes, the chances of a random hash-duplication are very real.  Just for the sports, if it’s evenly distributed and the world has, say, 1 billion computers,  the chances of duplicate-free distribution of VSN is around 0.187^(1 billion). So there are out there in fact quite a few duplicate VSNs.  But hey – unless you’re Microsoft, such global-scale stuff really shouldn’t trouble you. I mean, c’mon – say you have – what, 1000 clients? 10,000?  make it a hundred-thousand clients. You should never worry about the chance of a duplicate VSN. Now should you?

The real and sad answer, as I recently discovered, is that if you have two clients who use an identical computer model (at least by Dell, but probably true for all other major vendors), the chance of them having identical VSN is exactly ONE.

Dell do not separately format and install every hard drive of the kajillion they deploy. They make some master copy, then deep-copy it around (as us home users do with Acronis, Norton Ghost or whatever). As noted, the VSN is part of the data on the disk, and so is copied as well.

We tried to confirm this officialy with Dell, so far without success. The issue has very sparse web presence too, hence – this post. Hope it helps someone.

Memory Fragmentation Trouble

October 6, 2009 by Ofek Shilon

We recently had some weird issues that turned out to emanate from a failure to allocate a large consecutive chunk of heap memory.  (It was an exceptional pain to nail the cause there – maybe more on that in a future post).  The desired allocation was to be  ~400M, and since machines today ship more-or-less-by-default with 2G-4G RAM, there shouldn’t be a real justification for such allocations to fail.  Or should there?

First of all, regardless of your available physical RAM, your real memory playground size is 2G – the bottom half of your process’ address space, its user-mode portion.  Yes, I’m well aware of the /3GB boot.ini switch, and trust me – you don’t want to go there in a 3D application. I was badly burnt there already.  PAE/AWE have downright hostile API sets too – you’d just have to do with 2G.

The real issue here is memory fragmentation.

An obvious solution would be migrating to Win64, and forgetting about fragmentation issues for the near century. Sadly, this was not a feasible option for us: we have a legacy stash of in-house 32-bit custom hardware drivers, and migrating those would be the absolute last resort.

Happily, a  surprisingly short online research gave quite a few constructive 32-bit directions. Here are some.

  1. Low Fragmentation Heap is a nice built in feature, on by default since Vista.  you should apply LFH to the CRT heap, retrieved by _get_heap_handle (just try the sample code). Even better – try applying to all process heaps.   There should be no reason not to apply this to all projects, except (screeeeeeeeeeeeech..) it seems the magic doesn’t work on standard debug builds.  Which, well, err, makes it kinda useless.
  2. HeapDecommitFreeBlockThreshold is a magical registry key that is advertised to make a noticable difference. It does so by causing the heap to hold on to small allocations just a bit longer. Such increase of the HeapManager jurisdiction can potentially prevent page ‘theft’ for non-heap usage, thereby reducing some fragmentation factors.
  3. Typically a lot of fragmentation (at the 100Megs scale) is caused by sparse mapping of binary images to the process address space, at load time.
    In simpler English, say your process uses forty 1-Meg dll, and maps them to memory in regular 50Meg intervals.  They now sparsely occupy just 40Megs of your available 2G, leaving no consecutive memory chunk larger than 49M!
    To counter that, first map your virtual address usage. Until recently you’d have to use either vadump or direct code instrumentation, but since this summer you have the incredible (as always) SysInternals tool VMMap. When you spot some dll’s that are just teasingly smiling at you from the middle of your address space, use editbin.exe to ruthlessly rebase them away.
  4. Pre-designate a large heap (say 500M) at link time, thus giving the heap a head start in the race for consecutive pages.

I decided to try the steps in order of increasing effort, and am overjoyed to say (2) & (4) sufficed. We now successfully allocate 400M chunks.

We did peek into the process with VMMap, though, and it did surface some interesting finds. For one, babylon translator, installed on all our development machines, has the HUTZPA to inject captlib.dll into the very middle of our precious address space.

My hunch says rebasing could indeed hold the highest impact. We may have to try that too eventually – I hope to post with some findings.

Debug/Release Numerical Differences

August 28, 2009 by Ofek Shilon

Below’s an (almost) exact copy of an internal paper I distributed a while ago. All names and places, and even code snippets, are actually very real..

We recently had trouble reproducing in debug builds the exact numerical behavior of release builds.   This just happens  occasionally, and we’re used to accept it as some compiler black magic.  This time we dug a bit deeper – I’m pretty confident we understood all root causes, and the conclusions might be useful for everyone.

There are three ways in which we explicitly cause such differences ourselves.

1.    Different code paths

Someone included following two code pieces in math3d.h. First one:


VECTOR& VECTOR::Normalize()

{
#if defined( _DEBUG )
 D3DXVec3Normalize((D3DXVECTOR3*)this, (D3DXVECTOR3*)this);
#else
   float norm2 = x * x + y * y + z * z;
   __m128 val = _mm_load_ss( &norm2 );
   _mm_store_ss( &norm2,
       _mm_and_ps(
             _mm_cmplt_ss( _mm_set_ss( EPSILON ), val ), _mm_rsqrt_ss( val )
                 )       );

   x *= norm2;
   y *= norm2;
   z *= norm2;

#endif

return *this;
}

Not only are the execution results numerically different, the _DEBUG path consistently runs 30% faster. The 2nd weird code is VECTOR::LengthRcpApprox, and is even weirder (both paths are equally unreadable, no consistent performance difference).   The file history doesn’t go back far enough to blame someone, but of course we all know its Amnon.  Anyway – I left only a single code path in both functions (the ones measured faster).

2.    /arch

Release builds were compiled with /arch:SSE2, (project properties-> Configuration properties -> C/C++ -> Code Generation-> Enable enhanced instruction set) while debug builds weren’t.

This means release builds were aware of SSE instructions, and mostly favored them over FPU instructions. Even for scalar computations, SSE is usually faster – but by default it differs in computational precision (roughly, SSE: 32 bit, FPU: 80 bit).  This was mitigated with a call to

Not only are the execution results numerically different, the _DEBUG path consistently runs 30% faster. The 2nd fuckup is in VECTOR::LengthRcpApprox, and is even weirder (both paths are equally unreadable, no consistent performance difference).   The file history doesn’t go back far enough to blame someone, but of course we all know its Amnon.  Anyway – I left only a single code path in both functions (the ones measured faster).

3.    Floating point model

Release builds were compiled with /fp:fast, and debug builds with /fp:precise. (Project properties-> Configuration properties -> C/C++ -> Code Generation-> Floating point model).

That’s a heavier issue, and in some sense the root cause. A thorough survey is here.

In a nutshell – the C++ standard is strict about order of operations, and about points of intermediate rounding (from register precision to final stack precision). For example, an expression like ‘a+b+c’ must be evaluated as ‘(a+b)+c’.   a+(b+c) is usually different – that’s the disease of working with finite precision floats:  to see why, try  a= 1.0,  b = c = 2^(-24)  (other examples at the link).

When you compile with /fp:fast, you explicitly say “dear compiler, I don’t care about this harta. When you see stuff like ‘a+b+c’, assume I don’t mind how you evaluate it – if I did, I’d put parenthesis”.   The compiler is free to make many similar optimization choices – a comprehensive list is at the link.

When we set /fp:precise to release builds, (after the previous 2 fixes), all remaining numerical differences vanish (well, at least all those I tested).  However, in some use scenarios I tested the performance penalty is tangible.

For now we set /fp:fast to all configurations (including debug) in all projects. Sadly, some numerical differences remain (the compiler does take different choices in optimized and non-optimized builds), but they seem 2 orders of magnitude smaller.

Bottom Line

There is now a much, much better chance of reproducing release-behaviour in debug builds. If there’s a behaviour you’re still unable to reproduce, try compiling both release & debug with /fp:precise and reproducing. This isthe only official way of producing consistent results,  and unless more brilliant “#ifdef _DEBUG” s are hiding in your code path – will do the trick.

Debugging Memory Leaks, Part 2: CRT support

June 19, 2009 by Ofek Shilon

This feature is well documented, but yet from what I see – doesn’t get the usage it deserves. Here’s a quick, beginner-oriented rehash – if only to refer my teammates.

Problem and Immediate Solution

If you’re developing MFC apps, the way you’ll usually notice any leaks is by terminating your app and seeing the following in the output window:

Detected memory leaks!
Dumping objects ->
C:\myfile.cpp(20): {130} normal block at 0x00780E80, 64 bytes long.
Data: <      > CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD
Object dump complete.

the {130} is emphasized on purpose – it is the serial number of the leaking allocation.  What if you could count allocation occurrences, and break in exactly the leaky one?  You could not only get a complete stack trace for the leak, but even step and debug it!

Well, it might not be that easy.  If the allocation serial number is not consistent across multiple runs, it means the leaking memory is allocated in a threaded code portion.  In such cases, you’re probably better off resorting to other methods.

Now if the serial number is consistent, what allocations exactly does it count? In what API do you brake??  must you really set a breakpoint with a ‘hit count’ there?  If the number is high, it could easily get prohibitively slow.

Happily, there is an easy solution to the second set of worries. Early in your code, call -

_CrtSetBreakAlloc(130)

Mid-depth Dive

The _CrtSetBreakAlloc trick is an iceberg-tip of some heavier CRT machinery. It is entirely operated for you if you use MFC, but you can use it yourself in non-MFC apps (although you should watch out for some issues).

The dump that starts it all is the output of _CrtDumpMemoryLeaks.  You can call it directly in your app – but if you’re down to a binary search for the leak , you  might prefer _CrtMemDumpAllObjectsSince for more fine-grained control of the allocations you dump. Maybe more on that in a future post.

The allocation counter resides in the undocumented _heap_alloc_dbg, which channels calls from all C allocators – namely new, malloc, and their brethren (_malloc_dbg, aligned /offset flavours, etc.).  This is a CRT apparatus, so naturally it won’t catch direct OS allocations (HeapAlloc, VirtualAlloc et. al.), not to mention COM allocators (SysAllocString, CoTaskMemAlloc etc.).

When you do set an allocation breakpoint, you’d (naturally) break in the counting function itself, _heap_alloc_dbg.  Your own code probably resides a good 4-5 stack levels below – go there and debug away.

Bonus

You can do all the above at runtime, from the debugger. The documentation for VS2005 says you can put -

{,,msvcr71d.dll}_crtBreakAlloc

in the watch window, and modify its contents directly (say, to 130). However, the documentation is wrong in 3 places.

  1. For VS2005, the correct dll version is msvcr80d.dll. This seems to have been fixed for the VS2008 and VS2010 pages.
  2. When using the context operator, you must use decorated symbol names – which amounts to adding another underscore.
  3. The value is an int*, and for some reason my debugger fails to deduce it himself.

All in all, just type at the watch window -

(int*){,,msvcr80d.dll}__crtBreakAlloc

And debug happily ever after.

Digg This

Viewing Debugee Environment Variables

June 17, 2009 by Ofek Shilon

Koby asks - “How suitable is VS automation for doing something like printing the values of all environment variables in the debugee process”?

In fact, I just learnt how to do that.

type ‘$env=0′, either at a watch window or an immediate window.

Doesn’t get much easier than that.  And unlike the commenter, it works magic for me in VS2005 too.

Breaking on System Functions with the Context Operator

May 30, 2009 by Ofek Shilon

The context operator is technically documented, although barely so. Seemed to me the reason is that it’s mostly broken – but turns out you can in fact get some value out of it.

The documented syntax template is:

{[function],[source],[module] } location

(it is one of 3 different templates, but that’s the one that’s useful to me). Apparently these names have different meaning in different contexts.

Broken, Official Syntax

Here ‘function’ seems to mean function name, ‘source’ means source file, ‘location’ means line number in source file. Some of the documented examples prefix it with ‘@’, others with ‘.’. (edit: this is explicitly acknowledged elsewhere, but never explained).  From brief experimenting, however, none of these are working and this feature seems all but completely broken.

{,MySource.cpp,}@20

can get you to break at random locations in the file, or refuse to parse altogether.

{MyFunc, MyApp.cpp,}@3

May actually work. May also set the break location at a different line in the function. If the specified line is out of the function range, it may either set a break at a seemingly random source location or fail to parse.

etc. etc.  Documentation hasn’t changed between VS2003 and VS2010, and it seems these issues aren’t going anywhere anytime soon.

Working, Not-Really-Official Syntax

Turns out if you interpret ‘location’ as a function name, you may actually get some work done.  This is mentioned as a single-line example in the VC6 documentation, and also sporadically on the web.  For example:

{,,}MyFunc

Actually works.

With slight modifications (that Gregg does not mention), this can be used to break into functions with no source code: (a) include the module (i.e., exe/dll name) and (b) use decorated function names (this is said to be redundant since VS2008).  Needless to say, if you intend to break in MS functions (Win32, CRT or other), you’d need to obtain public MS symbols – maybe more on that in a future post.

For example,

{,,}OutputDebugStringW

would fail to parse,

{,,}_OutputDebugStringW@4

would parse successfully but will not set a break, and finally -

{,,kernel32.dll}_OutputDebugStringW@4

would get the job done.

Getting decorated names can be much easier than stated before, since most interesting MS functions are C functions: the decorated name (and module) can be viewed in the call-stack window, by stepping into the function (in disassembly):

decorcallstack

Digg This