random memes }

UTF8/UCS conversion benchmark

Point of reference...

UCS (Unicode) to UTF8 conversion, and the reverse, when efficiently coded in C++ clocks in well above 100MB/s on current generation CPUs. If you are getting something much less - enough to be a problem - then there are questions you should ask. The following run spans 1 to 6 byte UTF8 encodings.

Recoding a UTF8 string to a normalized form is a bit slower at 122-87 MB/s.

    preston@athena:~/workspace/json-c$ time Release/json
    base 00000010 :  229.9 MB/s UCS to UTF8 conversion
    base 00000010 :  253.7 MB/s UTF8 to UCS conversion
    base 00000010 :  122.0 MB/s UTF8 recode
    base 02000010 :  196.8 MB/s UCS to UTF8 conversion
    base 02000010 :  177.1 MB/s UTF8 to UCS conversion
    base 02000010 :  100.6 MB/s UTF8 recode
    base 04000010 :  173.8 MB/s UCS to UTF8 conversion
    base 04000010 :  157.1 MB/s UTF8 to UCS conversion
    base 04000010 :   89.2 MB/s UTF8 recode
    base 06000010 :  174.0 MB/s UCS to UTF8 conversion
    base 06000010 :  158.5 MB/s UTF8 to UCS conversion
    base 06000010 :   89.2 MB/s UTF8 recode
    base 08000010 :  174.0 MB/s UCS to UTF8 conversion
    base 08000010 :  154.5 MB/s UTF8 to UCS conversion
    base 08000010 :   89.6 MB/s UTF8 recode
    base 0a000010 :  174.1 MB/s UCS to UTF8 conversion
    base 0a000010 :  156.5 MB/s UTF8 to UCS conversion
    base 0a000010 :   89.8 MB/s UTF8 recode
    base 0c000010 :  174.0 MB/s UCS to UTF8 conversion
    base 0c000010 :  155.1 MB/s UTF8 to UCS conversion
    base 0c000010 :   89.6 MB/s UTF8 recode
    base 0e000010 :  174.0 MB/s UCS to UTF8 conversion
    base 0e000010 :  158.8 MB/s UTF8 to UCS conversion
    base 0e000010 :   87.4 MB/s UTF8 recode
    base 10000010 :  170.8 MB/s UCS to UTF8 conversion
    base 10000010 :  158.2 MB/s UTF8 to UCS conversion
    base 10000010 :   89.7 MB/s UTF8 recode
    base 12000010 :  174.0 MB/s UCS to UTF8 conversion
    base 12000010 :  158.8 MB/s UTF8 to UCS conversion
    base 12000010 :   86.5 MB/s UTF8 recode
    base 14000010 :  171.5 MB/s UCS to UTF8 conversion
    base 14000010 :  153.9 MB/s UTF8 to UCS conversion
    base 14000010 :   87.1 MB/s UTF8 recode
    base 16000010 :  172.1 MB/s UCS to UTF8 conversion
    base 16000010 :  158.1 MB/s UTF8 to UCS conversion
    base 16000010 :   87.5 MB/s UTF8 recode
    base 18000010 :  172.1 MB/s UCS to UTF8 conversion
    base 18000010 :  158.2 MB/s UTF8 to UCS conversion
    base 18000010 :   86.9 MB/s UTF8 recode
    base 1a000010 :  171.3 MB/s UCS to UTF8 conversion
    base 1a000010 :  158.2 MB/s UTF8 to UCS conversion
    base 1a000010 :   86.5 MB/s UTF8 recode
    base 1c000010 :  169.5 MB/s UCS to UTF8 conversion
    base 1c000010 :  158.5 MB/s UTF8 to UCS conversion
    base 1c000010 :   86.5 MB/s UTF8 recode
    base 1e000010 :  173.0 MB/s UCS to UTF8 conversion
    base 1e000010 :  157.8 MB/s UTF8 to UCS conversion
    base 1e000010 :   86.1 MB/s UTF8 recode
    base 20000010 :  173.1 MB/s UCS to UTF8 conversion
    base 20000010 :  158.4 MB/s UTF8 to UCS conversion
    base 20000010 :   87.2 MB/s UTF8 recode
    base 22000010 :  173.1 MB/s UCS to UTF8 conversion
    base 22000010 :  158.2 MB/s UTF8 to UCS conversion
    base 22000010 :   88.2 MB/s UTF8 recode
    base 24000010 :  173.3 MB/s UCS to UTF8 conversion
    base 24000010 :  158.8 MB/s UTF8 to UCS conversion
    base 24000010 :   87.6 MB/s UTF8 recode
    base 26000010 :  173.8 MB/s UCS to UTF8 conversion
    base 26000010 :  158.1 MB/s UTF8 to UCS conversion
    base 26000010 :   87.9 MB/s UTF8 recode
    base 28000010 :  172.0 MB/s UCS to UTF8 conversion
    base 28000010 :  158.7 MB/s UTF8 to UCS conversion
    base 28000010 :   87.6 MB/s UTF8 recode
    base 2a000010 :  172.0 MB/s UCS to UTF8 conversion
    base 2a000010 :  158.1 MB/s UTF8 to UCS conversion
    base 2a000010 :   86.1 MB/s UTF8 recode
    base 2c000010 :  172.1 MB/s UCS to UTF8 conversion
    base 2c000010 :  158.5 MB/s UTF8 to UCS conversion
    base 2c000010 :   86.3 MB/s UTF8 recode
    base 2e000010 :  173.8 MB/s UCS to UTF8 conversion
    base 2e000010 :  158.2 MB/s UTF8 to UCS conversion
    base 2e000010 :   89.3 MB/s UTF8 recode
    base 30000010 :  170.0 MB/s UCS to UTF8 conversion
    base 30000010 :  158.7 MB/s UTF8 to UCS conversion
    base 30000010 :   87.2 MB/s UTF8 recode
    base 32000010 :  173.8 MB/s UCS to UTF8 conversion
    base 32000010 :  158.4 MB/s UTF8 to UCS conversion
    base 32000010 :   87.8 MB/s UTF8 recode
    base 34000010 :  173.5 MB/s UCS to UTF8 conversion
    base 34000010 :  158.5 MB/s UTF8 to UCS conversion
    base 34000010 :   88.0 MB/s UTF8 recode
    base 36000010 :  173.1 MB/s UCS to UTF8 conversion
    base 36000010 :  158.4 MB/s UTF8 to UCS conversion
    base 36000010 :   88.4 MB/s UTF8 recode
    base 38000010 :  172.5 MB/s UCS to UTF8 conversion
    base 38000010 :  158.5 MB/s UTF8 to UCS conversion
    base 38000010 :   88.7 MB/s UTF8 recode
    base 3a000010 :  170.5 MB/s UCS to UTF8 conversion
    base 3a000010 :  158.4 MB/s UTF8 to UCS conversion
    base 3a000010 :   87.4 MB/s UTF8 recode
    base 3c000010 :  169.3 MB/s UCS to UTF8 conversion
    base 3c000010 :  154.7 MB/s UTF8 to UCS conversion
    base 3c000010 :   86.3 MB/s UTF8 recode
    base 3e000010 :  172.0 MB/s UCS to UTF8 conversion
    base 3e000010 :  156.9 MB/s UTF8 to UCS conversion
    base 3e000010 :   87.9 MB/s UTF8 recode
    base 40000010 :  171.8 MB/s UCS to UTF8 conversion
    base 40000010 :  148.9 MB/s UTF8 to UCS conversion
    base 40000010 :   83.7 MB/s UTF8 recode
    base 42000010 :  173.1 MB/s UCS to UTF8 conversion
    base 42000010 :  149.3 MB/s UTF8 to UCS conversion
    base 42000010 :   88.1 MB/s UTF8 recode
    base 44000010 :  173.7 MB/s UCS to UTF8 conversion
    base 44000010 :  157.7 MB/s UTF8 to UCS conversion
    base 44000010 :   87.4 MB/s UTF8 recode
    base 46000010 :  172.1 MB/s UCS to UTF8 conversion
    base 46000010 :  152.2 MB/s UTF8 to UCS conversion
    base 46000010 :   87.3 MB/s UTF8 recode
    base 48000010 :  174.0 MB/s UCS to UTF8 conversion
    base 48000010 :  142.7 MB/s UTF8 to UCS conversion
    base 48000010 :   87.8 MB/s UTF8 recode
    base 4a000010 :  173.5 MB/s UCS to UTF8 conversion
    base 4a000010 :  150.2 MB/s UTF8 to UCS conversion
    base 4a000010 :   87.7 MB/s UTF8 recode
    base 4c000010 :  174.0 MB/s UCS to UTF8 conversion
    base 4c000010 :  158.4 MB/s UTF8 to UCS conversion
    base 4c000010 :   88.4 MB/s UTF8 recode
    base 4e000010 :  174.0 MB/s UCS to UTF8 conversion
    base 4e000010 :  156.9 MB/s UTF8 to UCS conversion
    base 4e000010 :   88.4 MB/s UTF8 recode

    real    2m0.751s
    user    2m0.540s
    sys 0m0.190s

The above run is on a AMD Phenom(tm) II X4 955 Processor running at the stock clock rate.

Code is in: http://svn.bannister.us/public/json-c/ (A start on an experiment in fastest-possible JSON conversion.)

Note that I intentionally allow proper conversion of "invalid" UTF8 code strings. I completely understand the reason for the disallowed conversions, and I disagree.

Update: Converted to use pointer arithmetic, rather than array and index. Was not sure pointer math was still a win on current CPUs and compilers. Got a big boost in throughput, so it is!