UTF8/UCS conversion benchmark
Point of reference…
UCS (Unicode) to UTF8 conversion, and the reverse, when efficiently coded in C++ clocks in well above 100MB/s on current generation CPUs. If you are getting something much less – enough to be a problem – then there are questions you should ask. The following run spans 1 to 6 byte UTF8 encodings.
Recoding a UTF8 string to a normalized form is a bit slower at 122-87 MB/s.
preston@athena:~/workspace/json-c$ time Release/json base 00000010 : 229.9 MB/s UCS to UTF8 conversion base 00000010 : 253.7 MB/s UTF8 to UCS conversion base 00000010 : 122.0 MB/s UTF8 recode base 02000010 : 196.8 MB/s UCS to UTF8 conversion base 02000010 : 177.1 MB/s UTF8 to UCS conversion base 02000010 : 100.6 MB/s UTF8 recode base 04000010 : 173.8 MB/s UCS to UTF8 conversion base 04000010 : 157.1 MB/s UTF8 to UCS conversion base 04000010 : 89.2 MB/s UTF8 recode base 06000010 : 174.0 MB/s UCS to UTF8 conversion base 06000010 : 158.5 MB/s UTF8 to UCS conversion base 06000010 : 89.2 MB/s UTF8 recode base 08000010 : 174.0 MB/s UCS to UTF8 conversion base 08000010 : 154.5 MB/s UTF8 to UCS conversion base 08000010 : 89.6 MB/s UTF8 recode base 0a000010 : 174.1 MB/s UCS to UTF8 conversion base 0a000010 : 156.5 MB/s UTF8 to UCS conversion base 0a000010 : 89.8 MB/s UTF8 recode base 0c000010 : 174.0 MB/s UCS to UTF8 conversion base 0c000010 : 155.1 MB/s UTF8 to UCS conversion base 0c000010 : 89.6 MB/s UTF8 recode base 0e000010 : 174.0 MB/s UCS to UTF8 conversion base 0e000010 : 158.8 MB/s UTF8 to UCS conversion base 0e000010 : 87.4 MB/s UTF8 recode base 10000010 : 170.8 MB/s UCS to UTF8 conversion base 10000010 : 158.2 MB/s UTF8 to UCS conversion base 10000010 : 89.7 MB/s UTF8 recode base 12000010 : 174.0 MB/s UCS to UTF8 conversion base 12000010 : 158.8 MB/s UTF8 to UCS conversion base 12000010 : 86.5 MB/s UTF8 recode base 14000010 : 171.5 MB/s UCS to UTF8 conversion base 14000010 : 153.9 MB/s UTF8 to UCS conversion base 14000010 : 87.1 MB/s UTF8 recode base 16000010 : 172.1 MB/s UCS to UTF8 conversion base 16000010 : 158.1 MB/s UTF8 to UCS conversion base 16000010 : 87.5 MB/s UTF8 recode base 18000010 : 172.1 MB/s UCS to UTF8 conversion base 18000010 : 158.2 MB/s UTF8 to UCS conversion base 18000010 : 86.9 MB/s UTF8 recode base 1a000010 : 171.3 MB/s UCS to UTF8 conversion base 1a000010 : 158.2 MB/s UTF8 to UCS conversion base 1a000010 : 86.5 MB/s UTF8 recode base 1c000010 : 169.5 MB/s UCS to UTF8 conversion base 1c000010 : 158.5 MB/s UTF8 to UCS conversion base 1c000010 : 86.5 MB/s UTF8 recode base 1e000010 : 173.0 MB/s UCS to UTF8 conversion base 1e000010 : 157.8 MB/s UTF8 to UCS conversion base 1e000010 : 86.1 MB/s UTF8 recode base 20000010 : 173.1 MB/s UCS to UTF8 conversion base 20000010 : 158.4 MB/s UTF8 to UCS conversion base 20000010 : 87.2 MB/s UTF8 recode base 22000010 : 173.1 MB/s UCS to UTF8 conversion base 22000010 : 158.2 MB/s UTF8 to UCS conversion base 22000010 : 88.2 MB/s UTF8 recode base 24000010 : 173.3 MB/s UCS to UTF8 conversion base 24000010 : 158.8 MB/s UTF8 to UCS conversion base 24000010 : 87.6 MB/s UTF8 recode base 26000010 : 173.8 MB/s UCS to UTF8 conversion base 26000010 : 158.1 MB/s UTF8 to UCS conversion base 26000010 : 87.9 MB/s UTF8 recode base 28000010 : 172.0 MB/s UCS to UTF8 conversion base 28000010 : 158.7 MB/s UTF8 to UCS conversion base 28000010 : 87.6 MB/s UTF8 recode base 2a000010 : 172.0 MB/s UCS to UTF8 conversion base 2a000010 : 158.1 MB/s UTF8 to UCS conversion base 2a000010 : 86.1 MB/s UTF8 recode base 2c000010 : 172.1 MB/s UCS to UTF8 conversion base 2c000010 : 158.5 MB/s UTF8 to UCS conversion base 2c000010 : 86.3 MB/s UTF8 recode base 2e000010 : 173.8 MB/s UCS to UTF8 conversion base 2e000010 : 158.2 MB/s UTF8 to UCS conversion base 2e000010 : 89.3 MB/s UTF8 recode base 30000010 : 170.0 MB/s UCS to UTF8 conversion base 30000010 : 158.7 MB/s UTF8 to UCS conversion base 30000010 : 87.2 MB/s UTF8 recode base 32000010 : 173.8 MB/s UCS to UTF8 conversion base 32000010 : 158.4 MB/s UTF8 to UCS conversion base 32000010 : 87.8 MB/s UTF8 recode base 34000010 : 173.5 MB/s UCS to UTF8 conversion base 34000010 : 158.5 MB/s UTF8 to UCS conversion base 34000010 : 88.0 MB/s UTF8 recode base 36000010 : 173.1 MB/s UCS to UTF8 conversion base 36000010 : 158.4 MB/s UTF8 to UCS conversion base 36000010 : 88.4 MB/s UTF8 recode base 38000010 : 172.5 MB/s UCS to UTF8 conversion base 38000010 : 158.5 MB/s UTF8 to UCS conversion base 38000010 : 88.7 MB/s UTF8 recode base 3a000010 : 170.5 MB/s UCS to UTF8 conversion base 3a000010 : 158.4 MB/s UTF8 to UCS conversion base 3a000010 : 87.4 MB/s UTF8 recode base 3c000010 : 169.3 MB/s UCS to UTF8 conversion base 3c000010 : 154.7 MB/s UTF8 to UCS conversion base 3c000010 : 86.3 MB/s UTF8 recode base 3e000010 : 172.0 MB/s UCS to UTF8 conversion base 3e000010 : 156.9 MB/s UTF8 to UCS conversion base 3e000010 : 87.9 MB/s UTF8 recode base 40000010 : 171.8 MB/s UCS to UTF8 conversion base 40000010 : 148.9 MB/s UTF8 to UCS conversion base 40000010 : 83.7 MB/s UTF8 recode base 42000010 : 173.1 MB/s UCS to UTF8 conversion base 42000010 : 149.3 MB/s UTF8 to UCS conversion base 42000010 : 88.1 MB/s UTF8 recode base 44000010 : 173.7 MB/s UCS to UTF8 conversion base 44000010 : 157.7 MB/s UTF8 to UCS conversion base 44000010 : 87.4 MB/s UTF8 recode base 46000010 : 172.1 MB/s UCS to UTF8 conversion base 46000010 : 152.2 MB/s UTF8 to UCS conversion base 46000010 : 87.3 MB/s UTF8 recode base 48000010 : 174.0 MB/s UCS to UTF8 conversion base 48000010 : 142.7 MB/s UTF8 to UCS conversion base 48000010 : 87.8 MB/s UTF8 recode base 4a000010 : 173.5 MB/s UCS to UTF8 conversion base 4a000010 : 150.2 MB/s UTF8 to UCS conversion base 4a000010 : 87.7 MB/s UTF8 recode base 4c000010 : 174.0 MB/s UCS to UTF8 conversion base 4c000010 : 158.4 MB/s UTF8 to UCS conversion base 4c000010 : 88.4 MB/s UTF8 recode base 4e000010 : 174.0 MB/s UCS to UTF8 conversion base 4e000010 : 156.9 MB/s UTF8 to UCS conversion base 4e000010 : 88.4 MB/s UTF8 recode real 2m0.751s user 2m0.540s sys 0m0.190s
The above run is on a AMD Phenom(tm) II X4 955 Processor running at the stock clock rate.
Code is in: http://svn.bannister.us/public/json-c/
(A start on an experiment in fastest-possible JSON conversion.)
Note that I intentionally allow proper conversion of “invalid” UTF8 code strings. I completely understand the reason for the disallowed conversions, and I disagree.
Update: Converted to use pointer arithmetic, rather than array and index. Was not sure pointer math was still a win on current CPUs and compilers. Got a big boost in throughput, so it is!