Efficient UTF-8 recoding and secure processing

2010-01-15

An attempt to make a point...

The use of UTF-8 on the web is common and increasing. Lots of data comes in as UTF-8, and inefficiency in UTF-8 data handling is going to have pretty pervasive impact.

On the flip side, the creators of UTF-8 did a good job. There is nothing really complicated about the UTF-8 format, and processing is simple.

So I was surprised (or rather shocked) to find in an earlier experiment that Java performed UTF-8 conversion slowly. In fact, I was able to write a faster UTF-8 decoder in Java than the stock decoder. This is just plain wrong. Conversion between encodings is a primitive/simple operation best written in C/C++ and run as native code (and this is the sort of processing where C/C++ is probably always going to be much faster than Java).

There is a problem in that malicious external parties can send oddly-encoded UTF-8, and bypass simple-minded malware detection software. Ordinary ASCII characters can coded as an alternate multi-byte sequences, and simple scanners miss the alternate encoding.

This is a problem. There is a simple solution. One of a set of principles I adopted a long time ago is "convert at the edges". If you have data coming in from an untrusted source, then you perform conversion and validation at the "edge" where the data is first received.

In the case of UTF-8 coming from an untrusted source, to make all later processing simpler, you must recode to eliminate any alternate encodings. This is quite simple, as recoded UTF-8 will always be the same size or smaller, and so can be done in-place. The prior experiment measured the cost of UTF-8 recoding. Looks like we can drive a 1-Gbit network link at full speed (with efficient code), while recoding the entire contents. Since UTF-8 data usually represents a smaller portion of traffic, and since other processing tends to take the larger part of the load, there is no reason to not perform recoding on any UTF-8 data coming from an untrusted source.

Combine "convert at the edges" with UTF-8 recoding, and we lose the basis for the requirement in RFC 3629 for detection of "illegal" UTF-8 code sequences. In addition we allow all downstream processing to be simpler and more efficient ... and we also can be tolerant of imperfect upstream software. (Yes, I am going to invoke Postel, again.)

The basis is good (secure processing with untrusted sources), but the requirement for detection of "illegal" sequences is not necessary and (most definitely!) not optimal.

Example of UTF-8 recoding (from String.cpp).

void UTF8::String::recode() { // Iterate until all UTF8 characters are normalized. // UTF8 in canonical form can only be smaller, so work in-place. char* p1 = pBuffer; char* p2 = pBuffer; char* pEOS = pBuffer + nContent; while (p1 < pEOS) { int c = 255 & *p1++; if (c < 0x80) { *p2++ = c; continue; } if (c < 0xE0) { c = (31 & c) << 6; c |= 63 & *p1++; } else if (c < 0xF0) { c = (15 & c) << 12; c |= (63 & *p1++) << 6; c |= 63 & *p1++; } else if (c < 0xF8) { c = (7 & c) << 18; c |= (63 & *p1++) << 12; c |= (63 & *p1++) << 6; c |= 63 & *p1++; } else if (c < 0xFC) { c = (3 & c) << 24; c |= (63 & *p1++) << 18; c |= (63 & *p1++) << 12; c |= (63 & *p1++) << 6; c |= 63 & *p1++; } else { c = (1 & c) << 30; c |= (63 & *p1++) << 24; c |= (63 & *p1++) << 18; c |= (63 & *p1++) << 12; c |= (63 & *p1++) << 6; c |= 63 & *p1++; } if (c < 0x80) { *p2++ = c; } else if (c < 0x800) { *p2++ = 0xC0 | (c >> 6); *p2++ = 0x80 | (63 & c); } else if (c < 0x10000) { *p2++ = 0xE0 | (c >> 12); *p2++ = 0x80 | (63 & (c >> 6)); *p2++ = 0x80 | (63 & c); } else if (c < 0x200000) { *p2++ = 0xF0 | (c >> 18); *p2++ = 0x80 | (63 & (c >> 12)); *p2++ = 0x80 | (63 & (c >> 6)); *p2++ = 0x80 | (63 & c); } else if (c < 0x4000000) { *p2++ = 0xF8 | (c >> 24); *p2++ = 0x80 | (63 & (c >> 18)); *p2++ = 0x80 | (63 & (c >> 12)); *p2++ = 0x80 | (63 & (c >> 6)); *p2++ = 0x80 | (63 & c); } else { *p2++ = 0xFC | (1 & (c >> 30)); *p2++ = 0x80 | (63 & (c >> 24)); *p2++ = 0x80 | (63 & (c >> 18)); *p2++ = 0x80 | (63 & (c >> 12)); *p2++ = 0x80 | (63 & (c >> 6)); *p2++ = 0x80 | (63 & c); } } nContent = (int) (p2 - pBuffer); }