How to build a better string class - and a surprise

2005-01-16

I seldom use the C++ string classes supplied with the standard libraries, out of a belief that (for my purposes) that I can easily do better, building on bits of history and experience. Considerations that go into building a simple and highly efficient C++ string class include the following.

The standard <string.h> supplied C string functions are highly optimized (as described in a previous post).
Heap operations are best avoided (where practical) in long running applications. Use of a free list is very effective when the application has a cyclic workload (as most do).
Use of C++ inline class methods allows both readable code and most effective use of the optimizer.

In particular the Microsoft C/C++ compiler will emit x86 string instructions for common string operations. The x86 string operations are substantially faster than the equivalent sequence of generic x86 instructions. In fact I used x86 string instructions as the chief trick to write a fast GUI framework for the original 4.77Mhz 8088 original IBM PC (this is pre-Windows by quite a bit).

At least I knew this all was once true. Checking your assumptions is always a good idea. Last time I checked this particular assumption the i386 and i486 were current generation CPUs. Certainly the current generation of x86 CPUs are considerably changed.

Turns out I was in for a surprise. The source of the test program is at the end. First the results.

C:\tmp\benchmarks>bin\debug\string TIME old strings 5568 ticks for 1180000000 characters written RATE old strings 211.9 MB/second (1180.0 MB in 5.6 seconds)

C:\tmp\benchmarks>bin\release\string TIME old strings 8462 ticks for 1180000000 characters written RATE old strings 139.4 MB/second (1180.0 MB in 8.5 seconds)

Yikes! The non-optimized "debug" code is faster than the optimized "release" code?! Not exactly what I was expecting.

This above results are from compiling with the Microsoft Visual C++ 6.0 compiler.

After some experimentation I found that turning off inlined intrinsic functions (which in the case of string functions emits x86 string instructions) erased essentially all of the performance difference. (Note the pragma in the final sources found below).

C:\tmp\benchmarks>bin\debug\string TIME old strings 5859 ticks for 1180000000 characters written RATE old strings 201.4 MB/second (1180.0 MB in 5.9 seconds)

C:\tmp\benchmarks>bin\release\string TIME old strings 5738 ticks for 1180000000 characters written RATE old strings 205.6 MB/second (1180.0 MB in 5.7 seconds)

While the above results were a surprise, in a way they make sense. Current CPUs throw an insane amount of hardware at the problem of making a single instruction stream run as fast as possible. Likely the instruction traces showed the x86 string instructions to be rarely used. If so the CPU designers would quite logically choose to optimize the commonly used instructions, and not the string ops -- thus explaining the results above.

Note that my timings were all done on AMD Athlon CPUs (all I have on my home network). The results may be different for Intel Pentium CPUs -- and are almost certainly different between older and newer models. Comparative times between debug/release builds for each Pentium generation would be interesting, and might tell us when the CPU designers made the suspected tradeoff. (My guess would be that the string operations started to lose ground after the first generation Pentiums - but that is only a guess).

This deserves a bit more exploration. Tried compiling the same sources with the GNU C++ compiler.

C:\tmp\benchmarks>g++ -o string-gcc string\string.cpp

C:\tmp\benchmarks>string-gcc TIME old strings 7371 ticks for 1180000000 characters written RATE old strings 160.1 MB/second (1180.0 MB in 7.4 seconds)

C:\tmp\benchmarks>g++ -O6 -o string-gcc string\string.cpp

C:\tmp\benchmarks>string-gcc TIME old strings 7381 ticks for 1180000000 characters written RATE old strings 159.9 MB/second (1180.0 MB in 7.4 seconds)

In this case un-optimized code again beats the optimized code by a small amount. The GNU compiler times are not as good as the Microsoft compiler times -- but given this is a micro-benchmark, do not count too much on the significance of this lone measure. Apparently the GNU compiler does not emit inline x86 string instructions.

From looking at the assembly listing it appears with inlined intrinsics suppressed, the Microsoft compiler emits calls to the standard C library string functions. We might expect the code to do a bit better without the call overhead, so I added inline functions for the tested operations. Also tried the usual combinations of loop unrolling, word versus byte moves, autoincrement versus array offsets (none of which were used in my original string class).

C:\tmp\benchmarks>bin\release\string RATE C intrinsic string calls 428.5 MB/second (1180.0 MB in 2.8 seconds) RATE C function string calls 442.9 MB/second (1180.0 MB in 2.7 seconds) RATE x1 string calls 329.1 MB/second (1180.0 MB in 3.6 seconds) RATE x2 string calls 219.8 MB/second (1180.0 MB in 5.4 seconds) RATE x4 string calls 197.4 MB/second (1180.0 MB in 6.0 seconds) RATE x4x string calls 266.0 MB/second (1180.0 MB in 4.4 seconds) RATE x8 string calls 218.6 MB/second (1180.0 MB in 5.4 seconds) RATE x8a string calls 208.9 MB/second (1180.0 MB in 5.6 seconds) RATE x8j string calls 208.9 MB/second (1180.0 MB in 5.6 seconds) RATE x16j string calls 217.8 MB/second (1180.0 MB in 5.4 seconds) RATE x16a string calls 249.6 MB/second (1180.0 MB in 4.7 seconds) RATE mem string calls 239.5 MB/second (1180.0 MB in 4.9 seconds)

Since the MSVC 6.0 compiler pre-dates the Athlon CPU by a few years, you might wonder how code emitted from the current Microsoft C++ compiler compares. As it turns out, even when the newer compiler is told to optimize specifically for the Athlon, the benchmark results are essentially the same.

C:\tmp\benchmarks>"c:\program files\Microsoft Visual C++ Toolkit 2003\bin\cl.exe" /G7 /O2 string\string.cpp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

/out:string.exe string.obj

C:\tmp\benchmarks>string RATE C intrinsic string calls 315.1 MB/second (1180.0 MB in 3.7 seconds) RATE C function string calls 439.6 MB/second (1180.0 MB in 2.7 seconds) RATE x1 string calls 381.3 MB/second (1180.0 MB in 3.1 seconds) RATE x2 string calls 210.8 MB/second (1180.0 MB in 5.6 seconds) RATE x4 string calls 201.7 MB/second (1180.0 MB in 5.8 seconds) RATE x4x string calls 246.5 MB/second (1180.0 MB in 4.8 seconds) RATE x8 string calls 210.8 MB/second (1180.0 MB in 5.6 seconds) RATE x8a string calls 220.2 MB/second (1180.0 MB in 5.4 seconds) RATE x8j string calls 217.4 MB/second (1180.0 MB in 5.4 seconds) RATE x16j string calls 218.6 MB/second (1180.0 MB in 5.4 seconds) RATE x16a string calls 267.2 MB/second (1180.0 MB in 4.4 seconds) RATE mem string calls 244.5 MB/second (1180.0 MB in 4.8 seconds)

When optimized for the Athlon the times change a bit, but the conclusions work out pretty much the same:

Calling the string functions in the C library yields the highest performance.
All the usual optimizations (loop unrolling, word moves) come in slower than a simple one-character-at-a-time while loop.
The "intrinsic" string calls (using inline x86 string instructions) sometimes lose to a simple inlined while loop.

Looks like we can forget about all the old/usual optimization tricks. That leaves the string functions, an inlined one-character while loop, and inlined x86 string instructions as our top three choices.

Note that in this micro-benchmark the inner loop is quite short, and the number of characters moved by each string operation is relatively large. Compared to a more "usual" program (more characters moved / larger inner loop) this minimizes the calling overhead for out-of-line string functions, and gives the compiler less of an opportunity to optimize inline code. My theory is that for a larger inner loop with smaller strings, the out-of-line string functions will lose much of their advantage. In fact, I suspect the simple inlined one-character while loop will end up our top contendor. Something the explore in the next iteration...

#include <stdlib.h> #include <stdio.h> #include <time.h> #include <string.h> //#include

static const char sOut[] = "ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghjklmnopqrstuvwxyz123456789"; int nLength = ::strlen(sOut);

//using namespace std;

void report_times(const char* s,int dt,int cb) { //printf("TIME %s %d ticks for %d characters written\n",s,dt,cb); double ts = (double)dt / CLOCKS_PER_SEC; double mb = (double)cb / 1000000; double rate = mb / ts; printf("RATE %s %0.1f MB/second (%0.1f MB in %0.1f seconds)\n",s,rate,mb,ts); }

// // Perform a function for a fixed number of iterations and return time. //

const int nLoop = 10000000; int dtLoop = 0;

typedef void (doit)(const char,const char*);

int time_function(doit fn) { clock_t t0 = ::clock(); for (int i=0; i<nLoop; ++i) { const char* s1 = sOut + (31 & i); const char* s2 = sOut + nLength - (31 & i); (*fn)(s1,s2); } return (int)(::clock() - t0) - dtLoop; }

// // Time C string operations. //

char sWork[256]; int nTotal = 0;

void do_total(const char* s1,const char* s2) { nTotal += nLength; nTotal += nLength; }

void do_c0_string_calls(const char* s1,const char* s2) { ::strcpy(sWork,s1); ::strcat(sWork,s2); ::strcpy(sWork,s1); ::strcat(sWork,s2); }

// Optimization that seems to work best (at least with an Athlon). #pragma function(strlen,strcpy,strcat)

void do_c1_string_calls(const char* s1,const char* s2) { ::strcpy(sWork,s1); ::strcat(sWork,s2); ::strcpy(sWork,s1); ::strcat(sWork,s2); }

inline void x1_strcpy(char* s1,const char* s2)

{ while (*s1++ = *s2++); }

inline void x1_strcat(char* s1,const char* s2) { while (*s1) ++s1; x1_strcpy(s1,s2); }

void do_x1_string_calls(const char* s1,const char* s2) { x1_strcpy(sWork,s1); x1_strcat(sWork,s2); x1_strcpy(sWork,s1); x1_strcat(sWork,s2); }

inline void x2_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2) >> 1; while (n--) { *s1++ = *s2++; *s1++ = *s2++; } while (s2 < p) { *s1++ = *s2++; } }

inline void x2_strcat(char* s1,const char* s2) { while (*s1) ++s1; x2_strcpy(s1,s2); }

void do_x2_string_calls(const char* s1,const char* s2) { x2_strcpy(sWork,s1); x2_strcat(sWork,s2); x2_strcpy(sWork,s1); x2_strcat(sWork,s2); }

inline void x4_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2) >> 2; while (n--) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } while (s2 < p) { *s1++ = *s2++; } }

inline void x4_strcat(char* s1,const char* s2) { while (*s1) ++s1; x4_strcpy(s1,s2); }

void do_x4_string_calls(const char* s1,const char* s2) { x4_strcpy(sWork,s1); x4_strcat(sWork,s2); x4_strcpy(sWork,s1); x4_strcat(sWork,s2); }

inline void x4x_strcpy(char* s1,const char* s2) { const char* p = s2; while (p++); int n = (int)(p - s2) >> 2; int p1 = (int*)s1; int* p2 = (int*)s2; while (n--) { p1++ = p2++; } s1 = (char)p1; s2 = (const char)p2; while (s2 < p) { *s1++ = *s2++; } }

inline void x4x_strcat(char* s1,const char* s2) { while (*s1) ++s1; x4x_strcpy(s1,s2); }

void do_x4x_string_calls(const char* s1,const char* s2) { x4x_strcpy(sWork,s1); x4x_strcat(sWork,s2); x4x_strcpy(sWork,s1); x4x_strcat(sWork,s2); }

inline void x8_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2) >> 3; while (n--) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } while (s2 < p) { *s1++ = *s2++; } }

inline void x8_strcat(char* s1,const char* s2) { while (*s1) ++s1; x8_strcpy(s1,s2); }

void do_x8_string_calls(const char* s1,const char* s2) { x8_strcpy(sWork,s1); x8_strcat(sWork,s2); x8_strcpy(sWork,s1); x8_strcat(sWork,s2); }

inline void m_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2); ::memcpy(s1,s2,n+1); }

inline void m_strcat(char* s1,const char* s2) { while (*s1) ++s1; m_strcpy(s1,s2); }

void do_m_string_calls(const char* s1,const char* s2) { m_strcpy(sWork,s1); m_strcat(sWork,s2); m_strcpy(sWork,s1); m_strcat(sWork,s2); }

inline void x8a_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2) >> 3; while (n--) { s1[0] = s2[0]; s1[1] = s2[1]; s1[2] = s2[2]; s1[3] = s2[3]; s1[4] = s2[4]; s1[5] = s2[5]; s1[6] = s2[6]; s1[7] = s2[7]; s1 += 8; s2 += 8; } while (s2 < p) { *s1++ = *s2++; } }

inline void x8a_strcat(char* s1,const char* s2) { while (*s1) ++s1; x8a_strcpy(s1,s2); }

void do_x8a_string_calls(const char* s1,const char* s2) { x8a_strcpy(sWork,s1); x8a_strcat(sWork,s2); x8a_strcpy(sWork,s1); x8a_strcat(sWork,s2); }

inline void x8j_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2); goto _0; _8: *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; _0: if (n >> 3) { n -= 8; goto _8; } if (1 & n) { *s1++ = *s2++; } if (2 & n) { *s1++ = *s2++; *s1++ = *s2++; } if (4 & n) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } }

inline void x8j_strcat(char* s1,const char* s2) { while (*s1) ++s1; x8j_strcpy(s1,s2); }

void do_x8j_string_calls(const char* s1,const char* s2) { x8j_strcpy(sWork,s1); x8j_strcat(sWork,s2); x8j_strcpy(sWork,s1); x8j_strcat(sWork,s2); }

inline void x16j_strcpy(char* s1,const char* s2) { const char* p = s2; while (*p++); int n = (int)(p - s2); goto _0; _16: *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; _0: if (n >> 4) { n -= 16; goto _16; } if (1 & n) { *s1++ = *s2++; } if (2 & n) { *s1++ = *s2++; *s1++ = *s2++; } if (4 & n) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } if (8 & n) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } }

inline void x16j_strcat(char* s1,const char* s2) { while (*s1) ++s1; x16j_strcpy(s1,s2); }

void do_x16j_string_calls(const char* s1,const char* s2) { x16j_strcpy(sWork,s1); x16j_strcat(sWork,s2); x16j_strcpy(sWork,s1); x16j_strcat(sWork,s2); }

inline void x16a_strcpy(char* s1,const char* s2) { const char* p = s2; while (p++); int n = (int)(p - s2); goto _0; _16: ((int)s1)[0] = ((int*)s2)[0]; ((int*)s1)[1] = ((int*)s2)[1]; ((int*)s1)[2] = ((int*)s2)[2]; ((int*)s1)[3] = ((int*)s2)[3]; s1 += 16; s2 += 16; _0: if (n >> 4) { n -= 16; goto _16; } if (1 & n) { *s1++ = *s2++; } if (2 & n) { *s1++ = *s2++; *s1++ = *s2++; } if (4 & n) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } if (8 & n) { *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; *s1++ = *s2++; } }

inline void x16a_strcat(char* s1,const char* s2) { while (*s1) ++s1; x16a_strcpy(s1,s2); }

void do_x16a_string_calls(const char* s1,const char* s2) { x16a_strcpy(sWork,s1); x16a_strcat(sWork,s2); x16a_strcpy(sWork,s1); x16a_strcat(sWork,s2); }

int main(int ac,char** av) { dtLoop = time_function(do_total); report_times("C intrinsic string calls",time_function(do_c0_string_calls),nTotal); report_times("C function string calls",time_function(do_c1_string_calls),nTotal); report_times("x1 string calls",time_function(do_x1_string_calls),nTotal); report_times("x2 string calls",time_function(do_x2_string_calls),nTotal); report_times("x4 string calls",time_function(do_x4_string_calls),nTotal); report_times("x4x string calls",time_function(do_x4x_string_calls),nTotal); report_times("x8 string calls",time_function(do_x8_string_calls),nTotal); report_times("x8a string calls",time_function(do_x8a_string_calls),nTotal); report_times("x8j string calls",time_function(do_x8j_string_calls),nTotal); report_times("x16j string calls",time_function(do_x16j_string_calls),nTotal); report_times("x16a string calls",time_function(do_x16a_string_calls),nTotal); report_times("mem string calls",time_function(do_m_string_calls),nTotal); return 0; }