[Summary] [Continued]

Having checked my assumptions earlier in - Continued - building a better string class - assumptions revised - time to compare C++ string implementations.

The ZString1 class (below) represents the string class used in most of the C++ code I have written in the past decade. Nothing really exotic here - the class is simply a thin wrapper around the standard C string functions (made safe), with the string buffer allocated off a free list.

The ZString2 class simply replaces the use of string functions with inlined C++ code. From the earlier exercise, we gained the expectation that this version might be slightly faster than the calls to standard C string functions.

The micro-benchmark numbers tell pretty much the expected story.

RATE C string calls 262.1 MB/second (1000.0 MB in 3.8 seconds)
RATE ZString1 string calls 151.7 MB/second (1000.0 MB in 6.6 seconds)
RATE ZString2 string calls 158.0 MB/second (1000.0 MB in 6.3 seconds)
RATE C++ string calls 67.4 MB/second (1000.0 MB in 14.8 seconds)

The C string calls are fastest - at least when we are allocating the string buffer off the stack, and do not have to check for buffer overflow. This is traditional (and somewhat unsafe) practice in C programs, but not always practical.

The thin wrapper classes do markedly better than the standard C++ string class. To determine how much of the difference is due to the use of a free list, made the ZString classes always-allocate instead of using a free list, and re-ran the test.

RATE C string calls 266.3 MB/second (1000.0 MB in 3.8 seconds)
RATE ZString1 string calls 84.3 MB/second (1000.0 MB in 11.9 seconds)
RATE ZString2 string calls 88.8 MB/second (1000.0 MB in 11.3 seconds)
RATE C++ string calls 67.4 MB/second (1000.0 MB in 14.8 seconds)

Even without the use of a free list, the ZString classes do better (nearly double the performance) compared to the standard C++ strings. The use of the free list adds a substantial further boost to performance.

There is another set of reasons to consider using a free list for frequently allocated structures. Occasionally you can get patterns of allocation and deallocation that cause heap fragmentation, and this can become a problem in long-running programs. By greatly decreasing the number of heap allocations, you likely improve the stability of your program. This is not just theory - I have seen this more than once in actual customer use.

Adding the pragma so that the compiler generates string function calls instead of inline string intrinsics yields only slightly different numbers.

RATE C string calls 289.4 MB/second (1000.0 MB in 3.5 seconds)
RATE ZString1 string calls 145.8 MB/second (1000.0 MB in 6.9 seconds)
RATE ZString2 string calls 158.0 MB/second (1000.0 MB in 6.3 seconds)
RATE C++ string calls 67.4 MB/second (1000.0 MB in 14.8 seconds)

Tried running the same code through GNU C++ on Linux.

RATE C string calls 304.0 MB/second (1000.0 MB in 3.3 seconds)
RATE ZString1 string calls 170.1 MB/second (1000.0 MB in 5.9 seconds)
RATE ZString2 string calls 187.6 MB/second (1000.0 MB in 5.3 seconds)
RATE C++ string calls 81.7 MB/second (1000.0 MB in 12.2 seconds)

The relative results are pretty much in line with the Windows results (note the Linux box has a slightly faster CPU).

Tried the same code with Cygwin GNU C++ under Windows.

RATE C string calls 240.0 MB/second (1000.0 MB in 4.2 seconds)
RATE ZString1 string calls 72.0 MB/second (1000.0 MB in 13.9 seconds)
RATE ZString2 string calls 61.0 MB/second (1000.0 MB in 16.4 seconds)
RATE C++ string calls 4.9 MB/second (1000.0 MB in 202.9 seconds)

Yikes! Something is radically less efficient! The relative performance story is still the same.

Note that I am not claiming this micro-benchmark tells the performance story for every possible application using strings. Rather the intent here is to give a starting point. Also there is no intent here to handle anything other than single-byte character strings.

Note also that I am not trying to reproduce here every function offered by the standard C++ string classes. Rather I start with something like the ZString class as a base and add functions as-needed for the individual application. (I have a preference for thin/minimalistic classes).

The conclusion that I draw from all this is that my original string class does indeed still greatly out-perform the standard C++ string class (at least for the sort of usage I see most often). The string class can be slightly improved by disabling the string intrinsics, or by instead using equivalent inlined C++ code. In fact this micro-benchmark likely understates the performance advantage, if your application performs relatively more string allocations and relatively fewer string copy operations.

ZString.h

#ifndef __ZSTRING_H__
#define __ZSTRING_H__

//
//  Simple string class.
//

class ZString
{

protected:
    char* sBuffer;
    int cbBufferMax;

public:
    ZString() { upsizeTo(1); }
    ~ZString() { recycle(); }

protected:
    void recycle();
    void upsizeTo(int);
    void sizeTo(int n) {
        if (cbBufferMax <= n) {
            upsizeTo(n);
        }
    }

public:
    operator const char*()          { return sBuffer; }
    char* getBuffer()               { return sBuffer; }
    char* getBuffer(int n)          { sizeTo(n); return sBuffer; }
    int getBufferSize()             { return cbBufferMax; }

};

#endif

ZString1.h

#ifndef __ZSTRING1_H__
#define __ZSTRING1_H__

#include "ZString.h"

//
//  Simple string class.
//

class ZString1 : public ZString
{

public:
    void operator=(const char* s) {
        strcpy(s);
    }

public:
    static int strlen(const char* s) {
        return ::strlen(s);
    }
    int strlen() {
        return strlen(sBuffer);
    }
    void strcpy(const char* s) {
        sizeTo(strlen(s));
        ::strcpy(sBuffer,s);
    }
    void strcpy(const char* s,int n) {
        sizeTo(n);
        ::strncpy(sBuffer,s,n);
        sBuffer[n] = 0;
    }
    void strcat(const char* s) {
        int n1 = strlen();
        int n2 = strlen(s);
        sizeTo(n1 + n2);
        ::strcpy(sBuffer+n1,s);
    }
    void strcat(const char* s,int n) {
        int n1 = strlen();
        sizeTo(n1 + n);
        ::strncpy(sBuffer+n1,s,n);
        sBuffer[n1+n] = 0;
    }
    void strlwr() {
        ::strlwr(sBuffer);
    }
    void strupr() {
        ::strupr(sBuffer);
    }

};

#endif

ZString2.h

#ifndef __ZSTRING2_H__
#define __ZSTRING2_H__

#include "ZString.h"

//
//  Simple string class.
//

class ZString2 : public ZString
{

public:
    void operator=(const char* s) {
        strcpy(s);
    }

public:
    static int strlen(const char* s) {
        int n = 0;
        while (*s++) ++n;
        return n;
    }
    int strlen() {
        return strlen(sBuffer);
    }
    void strcpy(const char* s) {
        sizeTo(strlen(s));
        char* p = sBuffer;
        while (*p++ = *s++);
    }
    void strcpy(const char* s,int n) {
        int n2 = strlen(s);
        if (n2 < n) n = n2;
        sizeTo(n);
        char* p = sBuffer;
        int i = 0;
        while (i < n) { p[i] = s[i]; ++i; }
        p[i] = 0;
    }
    void strcat(const char* s) {
        int n1 = strlen();
        int n2 = strlen(s);
        sizeTo(n1 + n2);
        char* p = sBuffer + n1;
        while (*p++ = *s++);
    }
    void strcat(const char* s,int n) {
        int n1 = strlen();
        int n2 = strlen(s);
        if (n < n2) n2 = n;
        sizeTo(n1 + n2);
        char* p = sBuffer + n1;
        int i = 0;
        while (i < n2) { p[i] = s[i]; ++i; }
    }
    void strlwr() {
        ::strlwr(sBuffer);
    }
    void strupr() {
        ::strupr(sBuffer);
    }

};

#endif

ZString.cpp

#include <stdlib.h>
#include <string.h>
#include "ZString.h"

//
//  Out-of-line string methods and free list maintenance.
//

enum { STRING_BUFFER_SIZE = 256 };
static void* g_pFreeList;

void ZString::recycle()
{
    if (STRING_BUFFER_SIZE != cbBufferMax) {
        delete sBuffer;
        sBuffer = 0;
        return;
    }
    *((void**)sBuffer) = g_pFreeList;
    g_pFreeList = sBuffer;
    sBuffer = 0;
    cbBufferMax = 0;
}

void ZString::upsizeTo(int n)
{
    // Allocate oversize strings.
    if (STRING_BUFFER_SIZE <= n) {
        // round up to a quanta
        n = ((n + STRING_BUFFER_SIZE + 1) / STRING_BUFFER_SIZE) * STRING_BUFFER_SIZE;
        char* p = new char[n];
        ::strcpy(p,sBuffer);
        recycle();
        sBuffer = p;
        cbBufferMax = n;
        return;
    }
    cbBufferMax = STRING_BUFFER_SIZE;
    // Grab a buffer from the free list if present (the usual case).
    if (g_pFreeList) {
        sBuffer = (char*) g_pFreeList;
        g_pFreeList = *((void**)g_pFreeList);
    } else {
        // Allocate a stock-sized buffer.
        sBuffer = new char[STRING_BUFFER_SIZE];
    }
    *sBuffer = 0;
}

string.cpp

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>

// Optimization that seems to work best (at least with an Athlon).
//#pragma function(strlen,strcpy,strcat)

#include "ZString1.h"
#include "ZString2.h"
#include <string>

// test strings
static const char sOut1[] = "abcdefghjklmnopqrstuvwxyz";
static const char sOut2[] = "ABCDEFGHJKLMNOPQRSTUVWXYZ";
int nLength = ::strlen(sOut1);

void report_times(const char* s,int dt,int cb)
{
    //printf("TIME %s %d ticks for %d characters written\n",s,dt,cb);
    double ts = (double)dt / CLOCKS_PER_SEC;
    double mb = (double)cb / 1000000;
    double rate = mb / ts;
    printf("RATE %s %0.1f MB/second (%0.1f MB in %0.1f seconds)\n",s,rate,mb,ts);
}

//
//  Perform a function for a fixed number of iterations and return time.
//

const int nLoop = 10000000;
int dtLoop = 0;

typedef void (*doit)(const char*,const char*);

int time_function(doit fn)
{
    clock_t t0 = ::clock();
    for (int i=0; i<nLoop; ++i) {
        const char* s1 = sOut1 + (15 & i);
        const char* s2 = sOut2 + nLength - (15 & i);
        (*fn)(s1,s2);
    }
    return (int)(::clock() - t0) - dtLoop;
}

//
//  Time C string operations.
//

int nTotal = 0;

void do_total(const char* s1,const char* s2)
{
    nTotal += 4 * nLength;
}

void do_c_string_calls(const char* s1,const char* s2)
{
    char sWork[256];
    ::strcpy(sWork,s1);
    ::strcat(sWork,s2);
    ::strcpy(sWork,s2);
    ::strcat(sWork,s1);
    ::strcpy(sWork,s1);
    ::strcat(sWork,s1);
    ::strcpy(sWork,s2);
    ::strcat(sWork,s2);
}

void do_z1_string_calls(const char* s1,const char* s2)
{
    ZString1 sWork;
    sWork.strcpy(s1);
    sWork.strcat(s2);
    sWork.strcpy(s2);
    sWork.strcat(s1);
    sWork.strcpy(s1);
    sWork.strcat(s1);
    sWork.strcpy(s2);
    sWork.strcat(s2);
}

void do_z2_string_calls(const char* s1,const char* s2)
{
    ZString2 sWork;
    sWork.strcpy(s1);
    sWork.strcat(s2);
    sWork.strcpy(s2);
    sWork.strcat(s1);
    sWork.strcpy(s1);
    sWork.strcat(s1);
    sWork.strcpy(s2);
    sWork.strcat(s2);
}

void do_cpp_string_calls(const char* s1,const char* s2)
{
    std::string sWork;
    sWork.assign(s1);
    sWork.append(s2);
    sWork.assign(s2);
    sWork.append(s1);
    sWork.assign(s1);
    sWork.append(s1);
    sWork.assign(s2);
    sWork.append(s2);
}


int main(int ac,char** av)
{
    dtLoop = time_function(do_total);
    report_times("C string calls",time_function(do_c_string_calls),nTotal);
    report_times("ZString1 string calls",time_function(do_z1_string_calls),nTotal);
    report_times("ZString2 string calls",time_function(do_z2_string_calls),nTotal);
    report_times("C++ string calls",time_function(do_cpp_string_calls),nTotal);
    return 0;
}