UTF8/UCS conversion benchmark
Point of reference...
UCS (Unicode) to UTF8 conversion, and the reverse, when efficiently coded in C++ clocks in well above 100MB/s on current generation CPUs. If you are getting something much less - enough to be a problem - then there are questions you should ask. The following run spans 1 to 6 byte UTF8 encodings.
Recoding a UTF8 string to a normalized form is a bit slower at 122-87 MB/s.
preston@athena:~/workspace/json-c$ time Release/json
base 00000010 : 229.9 MB/s UCS to UTF8 conversion
base 00000010 : 253.7 MB/s UTF8 to UCS conversion
base 00000010 : 122.0 MB/s UTF8 recode
base 02000010 : 196.8 MB/s UCS to UTF8 conversion
base 02000010 : 177.1 MB/s UTF8 to UCS conversion
base 02000010 : 100.6 MB/s UTF8 recode
base 04000010 : 173.8 MB/s UCS to UTF8 conversion
base 04000010 : 157.1 MB/s UTF8 to UCS conversion
base 04000010 : 89.2 MB/s UTF8 recode
base 06000010 : 174.0 MB/s UCS to UTF8 conversion
base 06000010 : 158.5 MB/s UTF8 to UCS conversion
base 06000010 : 89.2 MB/s UTF8 recode
base 08000010 : 174.0 MB/s UCS to UTF8 conversion
base 08000010 : 154.5 MB/s UTF8 to UCS conversion
base 08000010 : 89.6 MB/s UTF8 recode
base 0a000010 : 174.1 MB/s UCS to UTF8 conversion
base 0a000010 : 156.5 MB/s UTF8 to UCS conversion
base 0a000010 : 89.8 MB/s UTF8 recode
base 0c000010 : 174.0 MB/s UCS to UTF8 conversion
base 0c000010 : 155.1 MB/s UTF8 to UCS conversion
base 0c000010 : 89.6 MB/s UTF8 recode
base 0e000010 : 174.0 MB/s UCS to UTF8 conversion
base 0e000010 : 158.8 MB/s UTF8 to UCS conversion
base 0e000010 : 87.4 MB/s UTF8 recode
base 10000010 : 170.8 MB/s UCS to UTF8 conversion
base 10000010 : 158.2 MB/s UTF8 to UCS conversion
base 10000010 : 89.7 MB/s UTF8 recode
base 12000010 : 174.0 MB/s UCS to UTF8 conversion
base 12000010 : 158.8 MB/s UTF8 to UCS conversion
base 12000010 : 86.5 MB/s UTF8 recode
base 14000010 : 171.5 MB/s UCS to UTF8 conversion
base 14000010 : 153.9 MB/s UTF8 to UCS conversion
base 14000010 : 87.1 MB/s UTF8 recode
base 16000010 : 172.1 MB/s UCS to UTF8 conversion
base 16000010 : 158.1 MB/s UTF8 to UCS conversion
base 16000010 : 87.5 MB/s UTF8 recode
base 18000010 : 172.1 MB/s UCS to UTF8 conversion
base 18000010 : 158.2 MB/s UTF8 to UCS conversion
base 18000010 : 86.9 MB/s UTF8 recode
base 1a000010 : 171.3 MB/s UCS to UTF8 conversion
base 1a000010 : 158.2 MB/s UTF8 to UCS conversion
base 1a000010 : 86.5 MB/s UTF8 recode
base 1c000010 : 169.5 MB/s UCS to UTF8 conversion
base 1c000010 : 158.5 MB/s UTF8 to UCS conversion
base 1c000010 : 86.5 MB/s UTF8 recode
base 1e000010 : 173.0 MB/s UCS to UTF8 conversion
base 1e000010 : 157.8 MB/s UTF8 to UCS conversion
base 1e000010 : 86.1 MB/s UTF8 recode
base 20000010 : 173.1 MB/s UCS to UTF8 conversion
base 20000010 : 158.4 MB/s UTF8 to UCS conversion
base 20000010 : 87.2 MB/s UTF8 recode
base 22000010 : 173.1 MB/s UCS to UTF8 conversion
base 22000010 : 158.2 MB/s UTF8 to UCS conversion
base 22000010 : 88.2 MB/s UTF8 recode
base 24000010 : 173.3 MB/s UCS to UTF8 conversion
base 24000010 : 158.8 MB/s UTF8 to UCS conversion
base 24000010 : 87.6 MB/s UTF8 recode
base 26000010 : 173.8 MB/s UCS to UTF8 conversion
base 26000010 : 158.1 MB/s UTF8 to UCS conversion
base 26000010 : 87.9 MB/s UTF8 recode
base 28000010 : 172.0 MB/s UCS to UTF8 conversion
base 28000010 : 158.7 MB/s UTF8 to UCS conversion
base 28000010 : 87.6 MB/s UTF8 recode
base 2a000010 : 172.0 MB/s UCS to UTF8 conversion
base 2a000010 : 158.1 MB/s UTF8 to UCS conversion
base 2a000010 : 86.1 MB/s UTF8 recode
base 2c000010 : 172.1 MB/s UCS to UTF8 conversion
base 2c000010 : 158.5 MB/s UTF8 to UCS conversion
base 2c000010 : 86.3 MB/s UTF8 recode
base 2e000010 : 173.8 MB/s UCS to UTF8 conversion
base 2e000010 : 158.2 MB/s UTF8 to UCS conversion
base 2e000010 : 89.3 MB/s UTF8 recode
base 30000010 : 170.0 MB/s UCS to UTF8 conversion
base 30000010 : 158.7 MB/s UTF8 to UCS conversion
base 30000010 : 87.2 MB/s UTF8 recode
base 32000010 : 173.8 MB/s UCS to UTF8 conversion
base 32000010 : 158.4 MB/s UTF8 to UCS conversion
base 32000010 : 87.8 MB/s UTF8 recode
base 34000010 : 173.5 MB/s UCS to UTF8 conversion
base 34000010 : 158.5 MB/s UTF8 to UCS conversion
base 34000010 : 88.0 MB/s UTF8 recode
base 36000010 : 173.1 MB/s UCS to UTF8 conversion
base 36000010 : 158.4 MB/s UTF8 to UCS conversion
base 36000010 : 88.4 MB/s UTF8 recode
base 38000010 : 172.5 MB/s UCS to UTF8 conversion
base 38000010 : 158.5 MB/s UTF8 to UCS conversion
base 38000010 : 88.7 MB/s UTF8 recode
base 3a000010 : 170.5 MB/s UCS to UTF8 conversion
base 3a000010 : 158.4 MB/s UTF8 to UCS conversion
base 3a000010 : 87.4 MB/s UTF8 recode
base 3c000010 : 169.3 MB/s UCS to UTF8 conversion
base 3c000010 : 154.7 MB/s UTF8 to UCS conversion
base 3c000010 : 86.3 MB/s UTF8 recode
base 3e000010 : 172.0 MB/s UCS to UTF8 conversion
base 3e000010 : 156.9 MB/s UTF8 to UCS conversion
base 3e000010 : 87.9 MB/s UTF8 recode
base 40000010 : 171.8 MB/s UCS to UTF8 conversion
base 40000010 : 148.9 MB/s UTF8 to UCS conversion
base 40000010 : 83.7 MB/s UTF8 recode
base 42000010 : 173.1 MB/s UCS to UTF8 conversion
base 42000010 : 149.3 MB/s UTF8 to UCS conversion
base 42000010 : 88.1 MB/s UTF8 recode
base 44000010 : 173.7 MB/s UCS to UTF8 conversion
base 44000010 : 157.7 MB/s UTF8 to UCS conversion
base 44000010 : 87.4 MB/s UTF8 recode
base 46000010 : 172.1 MB/s UCS to UTF8 conversion
base 46000010 : 152.2 MB/s UTF8 to UCS conversion
base 46000010 : 87.3 MB/s UTF8 recode
base 48000010 : 174.0 MB/s UCS to UTF8 conversion
base 48000010 : 142.7 MB/s UTF8 to UCS conversion
base 48000010 : 87.8 MB/s UTF8 recode
base 4a000010 : 173.5 MB/s UCS to UTF8 conversion
base 4a000010 : 150.2 MB/s UTF8 to UCS conversion
base 4a000010 : 87.7 MB/s UTF8 recode
base 4c000010 : 174.0 MB/s UCS to UTF8 conversion
base 4c000010 : 158.4 MB/s UTF8 to UCS conversion
base 4c000010 : 88.4 MB/s UTF8 recode
base 4e000010 : 174.0 MB/s UCS to UTF8 conversion
base 4e000010 : 156.9 MB/s UTF8 to UCS conversion
base 4e000010 : 88.4 MB/s UTF8 recode
real 2m0.751s
user 2m0.540s
sys 0m0.190s
The above run is on a AMD Phenom(tm) II X4 955 Processor running at the stock clock rate.
Code is in: http://svn.bannister.us/public/json-c/ (A start on an experiment in fastest-possible JSON conversion.)
Note that I intentionally allow proper conversion of "invalid" UTF8 code strings. I completely understand the reason for the disallowed conversions, and I disagree.
Update: Converted to use pointer arithmetic, rather than array and index. Was not sure pointer math was still a win on current CPUs and compilers. Got a big boost in throughput, so it is!