PGO and OpenMP

Alan Ning just posted about a cool hack of VS’s PGO switch, which was a sad reminder of all the wonderful goodies I’d never taste in my present job.

PGO builds can incorporate many optimizations that regular builds cannot, thereby leading to ~7.5% speedup (as measured by the VC++ team) with *zero* coding effort.  Still, eagerly trying it – I got struck with "Fatal Error C1310: profile guided optimizations are not available with OpenMP".

You see occasional rants about this incompatibility, but these are met with consistent MS reluctance to address it. Quoting from the Connect bug:

Since PGO and OpenMP are both designed to increase performance, it is a shame that they can’t be used together, but there were some major design issues preventing this. There is an ordering dependency in the designs of PGO and OpenMP. OpenMP changes some things about the program state that are important to PGO, and it does this after PGO has already processed the program state and made many decisions based on it.

And further yet -

We will be considering a modification of the designs of PGO and/or OpenMP to get this to work in a future product version.

That was written in April 2005. 3 VC++ versions later, the incompatibility persists.

We are using OpenMP, extensively, as do probably all dev shops today who care about performance. Seems anyone who does any new native code today probably cares deeply about performance – and those are exactly the devs that both OpenMP and PGO are aimed at!  The stated obstacles in addressing this issue (essentially, getting the order of operations right) don’t sound that insurmountable to me. I’d be surprised if other compilers suffer similar limitations (but frankly I have no idea. Anyone has better info?).

As crescens2k answered me some time ago,

The way Microsoft generally do things is by demand. So if you think this is something which should get fixed soon, the best thing you can do is submit a suggestion and get everyone you know to support it.

Well said.  Please, take 20 seconds and upvote it.

Ternary Operator and Type Compatibility

The (?,:) operator, as in -

int a = (b>5)? 6 : 7 ;

is commonly referred to as the Ternary Operator. (It is a common abuse of terminology: a ternary operator is one that operates on three arguments. This particular ternary operator is the Conditional Expression operator.)  It has several surprising attributes.

First – did you know that it can result in an lvalue? This is perfectly valid c++ code:

int a, b, c = 5;
((c>4) ? a : b ) = 42;

I can imagine some scenarios where this could make code more succinct,  but for the most part (quoting my friend Liron Leist here) this is just the kind of code that makes compilers scratch their heads and go open a c++ book.

Here’s a second gotcha that bit me a while ago, and only recently did I feel I actually understand it. Lo the following:

class Pet   {};
class Dog : public Pet {};
class Cat : public Pet  {};

Pet* pPet = NULL;

// This compiles perfectly:
if( IsSocialGuy() )
    pPet = new Dog;
else
    pPet = new Cat;

// While this doesn't:
pPet = IsSocialGuy() ? new Dog : new Cat;

Kinda surprising. The compiler feedback even more so:

error C2446: ‘:’ : no conversion from ‘Cat *’ to ‘Dog *’

Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

Huh? cast Cat* to Dog*? Who ordered that? MSDN isn’t much help either:

If both expressions are of pointer types (…) , pointer conversions are performed to convert them to a common type.

Ok, what? Seems both of the expression alternatives needs to be evaluated as expressions – why must they result in identical types?  Even worse: suppose you can accept this as a c++ quirk – isn’t it painfully obvious what casting has to be done to convert both Cat* and Dog* to a common type (namely, Pet*)?

Well – thank the universe for Eric Lippert – no, it is not obvious one bit. The code continues:

void PutInMicrowave(Dog* pPet)
{
    cout << _T("Woof?  Splat.");
}

void PutInMicrowave(Cat* pPet)
{
    cout << _T("Meaw?  Splat.");
}

// How would you suggest compiling this call?
PutInMicrowave( IsThai() ? new Cat : new Dog);

If the types of the two expression alternatives do not coincide, the compiler faces the risk of being unable to resolve overloads. As Eric beautifully put it, ‘We need to be able to work out the type of an expression without knowing what it is being assigned to. Type information flows out of an expression, not into an expression.’

Elsewhere, Eric lays out the details of the logic behind ‘pointer conversions are performed to convert them to a common type’. This is, however, in C# context (as was his StackOverflow reply) – I can only speculate that the C++ semantic logic is similar, but can’t find any authoritative links (suggestions are welcome!)

On _purecall and the Overhead(s) of Virtual Functions

A while ago a friend asked me whether pure virtual functions have higher overhead than regular virtual functions. At the time I answered that this cannot be – the pure/non-pure distinction is meaningful only at compile time, and non-existent at runtime. Only a (long) while later did I connect the dots, and understand what he meant than (sorry Mordy..)

Regular Overhead

Virtual calls are known to be more costly than calls that are resolved at compile time.  Elan Ruskin measured ~50% difference – I measured a bit less, but the difference is certainly there. For functions that do real work the actual call overhead can be mostly neglected, but for functions that get called a lot in a scenario you’re struggling to optimize – you can get tangible results by eliminating virtual calls. It’s widely considered a good practice to use the added flexibility of virtual functions when you have a concrete reason and not just for the fun of it.

There are two reasons for the added call cost – main one being the CPU instruction-prefetch mechanism.  A regular call is resolved at compile time into something like -

call 0xabcd1234

- which the instruction caching apparatus (a.k.a trace cache) easily resolves ahead of time, and can tell where to continue the fetching from. However, a virtual call is compiled into something like -

call eax

Faced with the impossible task of predicting the contents of eax dozens of instructions in advance, the trace cache just stalls.

The second potential extra cost is an extra dereferencing. The first DWORD_PTR of a c++ object (neglecting virtual inheritance) is a pointer to it’s virtual table –  a table that is common to all instances of the same class. A call to a virtual function is resolved by first dereferencing that vtable pointer, and only then calling the function at a fixed offset from the vtable start. Maciej Sinilo tried to isolate this cost by comparing calls via explicit function pointers to calls via virtual functions, and turns out the difference is practically non-measurable. (BTW, I didn’t check but I suspect part of the reason is the compiler’s ability to resolve that dereference in compile time, in many situations).

Pure-Virtual Extra Overhead??

Well, not really. At least not directly. But the function ‘containers’ – classes that are used as interfaces – do come with some extra weight.

To put it short, an interface class by default has its own constructor and destructor – just like any other class. These are called just before/after a child class constructor/destructor – as you would expect in any hierarchy.

But wait… what? Why? If you didn’t provide an implementation for such a constructor,  what can it possibly do?

What every compiler-generated constructor does: set up the class vtable pointer. In this case, it is a very short lived pointer – one that is immediately overwritten by the child ctor.

Take the following code:

class A
{
public:
	__declspec(noinline) A()
		{ _tprintf_s(_T(" A::A "));}
	virtual ~A()
		{ _tprintf_s(_T(" A::~A "));}

	virtual	void	f() = 0;
	virtual	void	g() = 0;
};

class B : public A
{
public:
	__declspec(noinline) B()
		{ _tprintf_s(_T(" B::B "));}
	virtual ~B()
		{ _tprintf_s(_T(" B::~B "));}

	virtual	void	f()
		{ _tprintf_s(_T(" B::f "));}
	virtual	void	g()
		{ _tprintf_s(_T(" B::g "));}
};

int _tmain(int argc, _TCHAR* argv[])
{
	B b;
	return 0;
}

The order of events in B’s construction is as follows:

(1) The child ctor, B::B() is called and immediatly calls the parent ctor A::A().

(2) The parent ctor A::A(), setd the object vfptr to point to the common A vtable -

~A
xxx
xxx

(keep those xxx placeholders in mind – more on them soon). Watching the object state in VS at this point you see:

ACtor

- emphasized are the instruction that populates the object vfptr, and the referenced vtable itself.

(3) B::B() continues, and modifies the object vfptr to point to the common B vtable:

~B
B::f()
B::g()

and again a VS view:

BCtor

If you’d call f() from B::B() A::A() [thanks Roman!], which is the sort of self-foot-shots c++ allows,  you’d be using A’s vtable (which is the only one the object knows at that time), end up calling the nonexistent xxx and gloriously crash in runtime. Of course it’s never that direct, and it’s pretty much a consensus you should never call a class virtuals from its own ctor/dtor.

Why all that hassle??

Frankly, I don’t know. It seems C++ goes out of its way to build and tear apart abstract-parent vtables, than exposes them briefly only during child construction/destruction, and then the community expert unanimously recommend never to use them. Herb Sutter does give 3 scenarios where you might consider using them – I find none of them convincing, and generally consider this to be one of the C++ semantic warts.

So, can this extra weight be mitigated?

Yes – at least in MS-specific ways. The direct way is by adding a __declspec(novtable) modifier to the abstract interface declaration. If you can guarantee that the interface class would never need any constructors/destructors (which can be tricky at times), it would be more readable to use __interface instead.

Beyond the direct saving of the extra ctor/dtor work, a happy side effect of novtable is that it eliminates all references to the modified interface vtable. The linker is then able to remove it from the binary altogether – thereby reducing the binary size and providing some extra boost. (When applied to a lot of interfaces, this can get tangible results!)

Bonus – so what is xxx, really?

Google xxx and see for yourself. I’ll wait until you return. It’d probably be a while.

Ok – this probably deserved it’s own post, as there still appears to be some online confusion regarding it. Apparently the ‘= 0′ pure virtual syntax is leading some to believe that the xxx entries are truly zeros. In fact MSDN columnist Paul DiLascia wrote sometime in 2000 that -

…the compiler still generates a vtable all of whose entries are NULL and still generates code to initialize the vtable in the constructor or destructor for A.

That may actually have been true than (I’m not even sure of that), but certainly isn’t now.

xxx is the address of the CRT function _purecall, which is essentially a debugging hook. You can  control xxx’s value directly by overloading purecall yourself, or alternatively use _set_purecall_handler to route into your own handler from within _purecall. You might consider doing so, e.g., to collect stack traces or minidumps in production code.