A Day with VS11 Beta – part 2.5: Auto Vectorizer, done right

Start at the end: the main example analyzed in the previous post is plain wrong. This loop:

for (int i=0; i<1000; ++i)   sum += a[i];

Vectorizes perfectly.

Even after me wrongfully accusing his team with this fictitious vectorization miss, Jim Hogg was kind enough to (1) test it and report this reduction loop is indeed vectorized, (2) link to my post, and worse yet, (3) say he enjoyed this blog.   What can I say, I’m embarrassed and humbled.   Thanks Jim.

My mistake was not – as Jim suspected – omitting /fp:fast. Rather, the problem was I coded multiple simple tests into a single console app main function, and debugged the resulting binaries from ICC/MSVC in disassembly mode.   From a more thorough inspection it seems both ICC and MSVC now do an aggressive interleaving of computations, and if as I suspect the aging PDB format still maps a consecutive range of instruction addresses to each source line – the debugger has a hard time matching location in disassembly to a source line. All in all, most probably I pulled the right conclusions on the wrong loops.

I did similar tests again – this time checking a single loop in every test.  A different case quickly turned up where ICC vectorizes and MSVC doesn’t:

double  a[2] = { 1., 2.};
double b[20000];
double S = 0;
for(int i=0; i<20000; i+=2)
S += a[0]*b[i] + a[1]*b[i+1] ;

And just to make extra sure, here’s some disassembly:

MSVC:

image

ICC:

image

ICC does some loop unrolling too so the code is harder to follow – but for skimming purposes it suffices to note the ‘packed double’ mul version (mulpd) in ICC, contrasted with the ‘scalar double’ mul version (mulsd) in MSVC. Similar results are seen in single precision floats too.

As in the previous post, this is simplified code that aims to capture the essence of real vectorizable scenarios. Suppose, for example, you need to transform a 3D mesh by a fixed rotation and translation. This amounts to a large loop with computations of the above type: one argument constant, the other scanning an array.  Such code might benefit considerably from auto vectorization.

The real test was the last one to be described at the blog: build and measure some real life computationally intensive code.  I did just that, and the results were – as noted – no measurable improvement over VC10.  So either my code has less to benefit from vectorization than I hoped, or – the gaps remaining in the vectorizer hold more promise than the gaps already filled.

I gotta try and measure performance with ICC one day – if I’ll ever have the patience. Our code builds for nearly half an hour on MSVC, so I’m guessing ICC builds would have to be done neither nightly or over-weekendly.

A Day with VS11 Beta – part 2: Auto Vectorizer

UPDATE: While I still believe the overall conclusions below hold, the actual analysis of the main example is erroneous and kept here only out of respect for some external links.  A detailed correction is in the following post.  Thanks @JimHogg!


Vectorization is inherently a low level topic, so when discussing it there is a tendency to go deep into technical details – thereby losing both a large part of the audience and the wider perspective. I’ll try to avoid these pitfalls here – so no disassembly screenshots today.

Background

For over 15 years one of the main evolution directions for processors has been SIMD: the application of an operation in a single cycle (more or less) to multiple data elements, stored consecutively in a wide register. Initial incarnations were limited to integer types and so not very useful, but when floating point support had started – developer interest caught up.

The bread-and-butter of current vectorization, at least on desktop platforms, is SSE: an ever expanding brand of x86 instructions and registers introduced by Intel. SSE registers are 128-bit wide and every SSE instruction modifies such a register – thereby processing either 4 single-precision or two double-precision floats in a single cycle.  Prior to SSE, MMX instructions operated on 64-bit wide registers, the post-SSE AVX instructions work on 256, rumor has it that the next core-2 generation would maintain 512b registers – you get the idea.

Of course someone needs to write code to execute all these fancy instructions. In most of these 15 years this burden laid on the developer – and to a large extent it still does. Either by .asm files, inline assembly or intrinsics – some low level maven has to code at instruction level to utilize all this extra processing power. Optimized libraries do help with some common math and image processing tasks, but to a large extent mainstream code has yet to benefit anything from vectorization. This may change if compilers become smart enough to recognize vectorizable portions of high-level code, and to do the vectorization themselves.

Auto-Vectorization is the coveted ability to do just that: have the compiler do the vectorization for you.

VS11 is Microsoft’s first step towards this noble goal. I read that the dev-preview was less than stellar, but haven’t seen anything about the beta. I care deeply for this particular feature and had high hopes for it, and so set out to explore.

Results

As the VC compiler guys demonstrate in this Channel 9 session, ‘embarrassingly parallel’ code vectorizes well:

float a[1000], b[1000], c[1000];
for(int i=0; i<1000; ++i)  a[i] = b[i] + c[i];

Reduction poses an extra challenge, as it introduces a semantic dependence among loop iterations – but the following still vectorizes well:

float sum = 0.0f;
for(int i=0; i<1000; ++i) sum += a[i];

As far as I my brief exploration went, the optimizer sees easily through object-oriented abstractions. In header-only container templates I also saw many cross-function vectorizations (namely, where inlining took place).

But.

A Big But

The party stops here:

for(int i=0; i<1000; ++i)    sum += a[i]*b[i];

This code does not vectorize in VS11 beta.

Mind you, this is not a contrived example – it is a fundamental computation type in BLAS, 3D and pretty much every quazi-mathematical code: vector-vector dot product, matrix-vector product, matrix-matrix product – are all essentially such sums of products.

Now vectorization plainly is hard, and this loop in particular.  Data dependencies that prevent auto vectorization (and parallelization in general) are -

  • Read after write,
  • Write after read,
  • write after write.

And the sum-of-products loop above seems to contain all three .  However:

  1. The same holds for the direct sum reduction loop:   for(…)  sum += a[i], that VC11 vectorizes successfully.
  2. It just so happens that the Intel compiler documentation uses this very code as example for vectorization of reduction loops:

…there is a very important exception, that apparently contains all of the above types of dependency:

sum=0;
for (j=1; j<MAX; j++) sum = sum + A[j]*B[j]
Although sum is both read and written in every iteration, the [Intel] compiler recognizes such reduction idioms, and is able to vectorize them safely.

Competition

Optimization-wise the Intel compiler is really where the bar is set. Not only does it successfully vectorize a much wider range of cases, it supplies extensive pragmas and switches to control vectorization and can even report which loops were vectorized, and to some extent – why vectorization failed where it did. (In all the tests above I inspected disassembly to tell which loops vectorized in MSVC).

The lack of all this functionality is very much acceptable in beta – and who knows, maybe some of it is actually there, details slowly unveil in Jim Hogg’s series of posts. I hope even more would be exposed in the RTM – but I would be (pleasantly) surprised if the range of vectorizable loops would be broadened.

Intel’s compiler is markedly slower and it might seem that MS deliberately makes different performance/productivity tradeoffs, but now that it’s obvious MS does want to get vectorization done – it’s also obvious that they’re simply playing catch-up in this field.

Bottom Lines

The code of most products I work on is a classic example of case where auto-vectorizer should be beneficial: the tasks are highly intensive computationally, the code is entirely C++ with almost no low-level optimizations, it is also highly branch intensive and it is hard to find single bottlenecks where manual optimizations would justify the investment.

I built and run one of our core products in VC11 beta and measured a ~10 minutes computational job.  I saw no performance improvement over VC10 (VC11 actually took a bit longer, but well within statistical error).

As of May 2012, devs still must hand-optimize the code to rip the benefits of ~1995 processor improvements. Perhaps you’d be the one to change that?