Category Archives: Web

WordPress hacked (again)

On around May 13 someone subverted my weblog to serve pharmacy ads. Annoying, but not otherwise a big deal, given regular backups. This hack was more clever than prior incidents Took me longer to find and remove the problem.

I expect WordPress to be insecure. Looked at the source code early on. Like most PHP applications, the potential attack surface is very large.

Will be a bit before things are entirely in order. (Ick. Using a stock WordPress theme.)

Posted in Personal, Security, Web | Leave a comment

Stepped on someone’s gravy train?

About five years back I wrote an article with my colorful speculations about a locally advertised business – Model Quality Introductions. To my slight annoyance, that one article is consistently popular. I was strongly tempted to delete the article, as I was not comfortable with slightly tawdry speculations as the motivation for folk to visit my weblog.

If you type “model quality introductions” as a search string into Google or Bing (but not Yahoo – oddly), you get the site of the original business, followed immediately by my old weblog article.

This week I got an email (which I promised not to publish without consent) claiming to be from someone personally connected to the business, with the claim I was causing harm to their business. The writer was asking that I remove my old weblog article.

On reading the email there were two apparent possibilities. First, the email could be honest, and an attempt to rectify the unfair harm done by my article. Second, the email could be a lie – and an attempt by a marketeer (of some form) to eliminate unfavorable mention.

If the first were true, then I very much regret any harm I may have done. I offered to add the sender’s story, and hopefully offer some balance to readers who might be interested in their business.

If the later were true, then the email is lie, and a simple attempt to suppress any highly visible unfavorable information.

I should add that one of my very first jobs was with a guy who considered himself a super-salesman (Hello, Eric!), and would regularly offer bullshit persuasion on one topic or another. As a result, I tend to detect and reject very early the pattern of expression offered by sales/marketing critters. I should also add that while a teenager – and as a personal interest – I read rather a lot about Semantics and Psychology.

You can often tell a lot about a writer’s background from phrasing and how they express ideas. The initial email sounded rather more like the expressions of a marketeer, and rather less like the claimed source. Still … this was an uncertain judgement. In my response, I offered to encourage and protect the sender – if honest (and meant exactly what I wrote). I was – very carefully – honest in everything I said. At the same time I dropped in cues that I expected a marketeer would try to interpret as in their interest.

The subsequent responses fit exactly with what I would expect if the writer were a marketeer, and a third-rate mind. With each exchange, my conviction increased that the writer was a marketeer (and a liar).

Still … I could be wrong. The writer could be exactly as claimed. On that chance, I will honor my promises. On the other hand, my current bet is 10:1 that the writer is a marketeer.

As the exchange aroused my curiosity, I went searching, and found:

  • The online dating business must be hugely profitable, as the number of consultants and related businesses indicate.
  • The dating-service-with-rich-guys model must be very profitable, as “Model Quality Introductions” expanded to more locations – and there are many imitators.

The odds of a hired marketeer – outside the content of the exchange – seem rather high.

By analogy, I was playing chess, and the marketeer was playing checkers.
Bit amusing, this. :)

Posted in Web | Leave a comment

Using GMail for mailto: links in Ubuntu

Create the file $HOME/bin/mailto with the contents:

#!/bin/sh
gnome-open "https://mail.google.com/mail?extsrc=mailto&url=$*"

Make the file executable.

On Ubuntu Linux (using the Gnome desktop), go to:

System > Preferences > Preferred Applications

Under Internet / Mail Reader select “Custom” and enter the command:

/home/preston/bin/mailto %s

(Replace “/home/preston” with your $HOME.)

This should open GMail in your default web browser, composing a new message, with the recipient set.

Posted in Web | Leave a comment

Multiplexed FastCGI connections?

Does anyone use FastCGI with FCGI_MPXS_CONNS set to “1″ (for multiplexed connections)?

Most FastCGI backends seems to be written for non-multiplexed connections. (Much simpler, so understandable.) The IIS FastCGI connector apparently does not support multiplexed connections.

Writing a FastCGI backend that allows for multiplexed connections. Would be a waste of time if not supported by a frontend, or if existing frontends are buggy.

(Not really expecting a response, but have to state the question.)

Posted in Software, Web | Leave a comment

Giving up HTML@W3C

Got the “status as Invited Expert in HTML Working Group” email. This I will let expire. Spent my time tilting at windmills, and do not see any point in continuing.

The HTML Working Group at W3C is … far too much noise. The HTML5 “standard” is going to be a bloated monster, and there is no chance I can change that. Time to stop the pretense of trying.

Not that all the work is bad, or that there is any shortage of well-intentioned folk. What the group lacks is any sense of minimalism, and enough strong voices able to say “no”. What we will get is going to be even harder to digest than HTML4. This is sad. Future developers are going to have an even harder time, for no good reason.

The wildcard here is that the mainstream browser implementations may not follow all the half-thought ideas thrown in by the HTML5 Working Group. No idea how this will work out.

At the core HTML is pretty damn simple – or could be. HTML4 got stuffed with a bunch of half-thought notions, most of which have since proved of no value, and were ignored by developers. Ideally we would learn from experience, omit the fuzzy disused bits, and trim HTML down to the useful core. There are well-known (though not universally known) means to achieve this aim. This is not going to happen.

I cannot keep up with the herd of Energizer-bunnies eager to make their mark, and with too-limited experience.
Time to stop pretending.

Posted in Software, Web, html@w3c | Leave a comment

Efficient UTF-8 recoding and secure processing

An attempt to make a point…

The use of UTF-8 on the web is common and increasing. Lots of data comes in as UTF-8, and inefficiency in UTF-8 data handling is going to have pretty pervasive impact.

On the flip side, the creators of UTF-8 did a good job. There is nothing really complicated about the UTF-8 format, and processing is simple.

So I was surprised (or rather shocked) to find in an earlier experiment that Java performed UTF-8 conversion slowly. In fact, I was able to write a faster UTF-8 decoder in Java than the stock decoder. This is just plain wrong. Conversion between encodings is a primitive/simple operation best written in C/C++ and run as native code (and this is the sort of processing where C/C++ is probably always going to be much faster than Java).

There is a problem in that malicious external parties can send oddly-encoded UTF-8, and bypass simple-minded malware detection software. Ordinary ASCII characters can coded as an alternate multi-byte sequences, and simple scanners miss the alternate encoding.

This is a problem. There is a simple solution. One of a set of principles I adopted a long time ago is “convert at the edges”. If you have data coming in from an untrusted source, then you perform conversion and validation at the “edge” where the data is first received.

In the case of UTF-8 coming from an untrusted source, to make all later processing simpler, you must recode to eliminate any alternate encodings. This is quite simple, as recoded UTF-8 will always be the same size or smaller, and so can be done in-place. The prior experiment measured the cost of UTF-8 recoding. Looks like we can drive a 1-Gbit network link at full speed (with efficient code), while recoding the entire contents. Since UTF-8 data usually represents a smaller portion of traffic, and since other processing tends to take the larger part of the load, there is no reason to not perform recoding on any UTF-8 data coming from an untrusted source.

Combine “convert at the edges” with UTF-8 recoding, and we lose the basis for the requirement in RFC 3629 for detection of “illegal” UTF-8 code sequences. In addition we allow all downstream processing to be simpler and more efficient … and we also can be tolerant of imperfect upstream software. (Yes, I am going to invoke Postel, again.)

The basis is good (secure processing with untrusted sources), but the requirement for detection of “illegal” sequences is not necessary and (most definitely!) not optimal.

Example of UTF-8 recoding (from String.cpp).

void UTF8::String::recode() {
    // Iterate until all UTF8 characters are normalized.
    // UTF8 in canonical form can only be smaller, so work in-place.
    char* p1 = pBuffer;
    char* p2 = pBuffer;
    char* pEOS = pBuffer + nContent;
    while (p1 < pEOS) {
        int c = 255 & *p1++;
        if (c < 0x80) {
            *p2++ = c;
            continue;
        }
        if (c < 0xE0) {
            c = (31 & c) << 6;
            c |= 63 & *p1++;
        } else if (c < 0xF0) {
            c = (15 & c) << 12;
            c |= (63 & *p1++) << 6;
            c |= 63 & *p1++;
        } else if (c < 0xF8) {
            c = (7 & c) << 18;
            c |= (63 & *p1++) << 12;
            c |= (63 & *p1++) << 6;
            c |= 63 & *p1++;
        } else if (c < 0xFC) {
            c = (3 & c) << 24;
            c |= (63 & *p1++) << 18;
            c |= (63 & *p1++) << 12;
            c |= (63 & *p1++) << 6;
            c |= 63 & *p1++;
        } else {
            c = (1 & c) << 30;
            c |= (63 & *p1++) << 24;
            c |= (63 & *p1++) << 18;
            c |= (63 & *p1++) << 12;
            c |= (63 & *p1++) << 6;
            c |= 63 & *p1++;
        }
        if (c < 0x80) {
            *p2++ = c;
        } else if (c < 0x800) {
            *p2++ = 0xC0 | (c >> 6);
            *p2++ = 0x80 | (63 & c);
        } else if (c < 0x10000) {
            *p2++ = 0xE0 | (c >> 12);
            *p2++ = 0x80 | (63 & (c >> 6));
            *p2++ = 0x80 | (63 & c);
        } else if (c < 0x200000) {
            *p2++ = 0xF0 | (c >> 18);
            *p2++ = 0x80 | (63 & (c >> 12));
            *p2++ = 0x80 | (63 & (c >> 6));
            *p2++ = 0x80 | (63 & c);
        } else if (c < 0x4000000) {
            *p2++ = 0xF8 | (c >> 24);
            *p2++ = 0x80 | (63 & (c >> 18));
            *p2++ = 0x80 | (63 & (c >> 12));
            *p2++ = 0x80 | (63 & (c >> 6));
            *p2++ = 0x80 | (63 & c);
        } else {
            *p2++ = 0xFC | (1 & (c >> 30));
            *p2++ = 0x80 | (63 & (c >> 24));
            *p2++ = 0x80 | (63 & (c >> 18));
            *p2++ = 0x80 | (63 & (c >> 12));
            *p2++ = 0x80 | (63 & (c >> 6));
            *p2++ = 0x80 | (63 & c);
        }
    }
    nContent = (int) (p2 - pBuffer);
}
Posted in Software, Web | Leave a comment

UTF8/UCS conversion benchmark

Point of reference…

UCS (Unicode) to UTF8 conversion, and the reverse, when efficiently coded in C++ clocks in well above 100MB/s on current generation CPUs. If you are getting something much less – enough to be a problem – then there are questions you should ask. The following run spans 1 to 6 byte UTF8 encodings.

Recoding a UTF8 string to a normalized form is a bit slower at 122-87 MB/s.

preston@athena:~/workspace/json-c$ time Release/json
base 00000010 :  229.9 MB/s UCS to UTF8 conversion
base 00000010 :  253.7 MB/s UTF8 to UCS conversion
base 00000010 :  122.0 MB/s UTF8 recode
base 02000010 :  196.8 MB/s UCS to UTF8 conversion
base 02000010 :  177.1 MB/s UTF8 to UCS conversion
base 02000010 :  100.6 MB/s UTF8 recode
base 04000010 :  173.8 MB/s UCS to UTF8 conversion
base 04000010 :  157.1 MB/s UTF8 to UCS conversion
base 04000010 :   89.2 MB/s UTF8 recode
base 06000010 :  174.0 MB/s UCS to UTF8 conversion
base 06000010 :  158.5 MB/s UTF8 to UCS conversion
base 06000010 :   89.2 MB/s UTF8 recode
base 08000010 :  174.0 MB/s UCS to UTF8 conversion
base 08000010 :  154.5 MB/s UTF8 to UCS conversion
base 08000010 :   89.6 MB/s UTF8 recode
base 0a000010 :  174.1 MB/s UCS to UTF8 conversion
base 0a000010 :  156.5 MB/s UTF8 to UCS conversion
base 0a000010 :   89.8 MB/s UTF8 recode
base 0c000010 :  174.0 MB/s UCS to UTF8 conversion
base 0c000010 :  155.1 MB/s UTF8 to UCS conversion
base 0c000010 :   89.6 MB/s UTF8 recode
base 0e000010 :  174.0 MB/s UCS to UTF8 conversion
base 0e000010 :  158.8 MB/s UTF8 to UCS conversion
base 0e000010 :   87.4 MB/s UTF8 recode
base 10000010 :  170.8 MB/s UCS to UTF8 conversion
base 10000010 :  158.2 MB/s UTF8 to UCS conversion
base 10000010 :   89.7 MB/s UTF8 recode
base 12000010 :  174.0 MB/s UCS to UTF8 conversion
base 12000010 :  158.8 MB/s UTF8 to UCS conversion
base 12000010 :   86.5 MB/s UTF8 recode
base 14000010 :  171.5 MB/s UCS to UTF8 conversion
base 14000010 :  153.9 MB/s UTF8 to UCS conversion
base 14000010 :   87.1 MB/s UTF8 recode
base 16000010 :  172.1 MB/s UCS to UTF8 conversion
base 16000010 :  158.1 MB/s UTF8 to UCS conversion
base 16000010 :   87.5 MB/s UTF8 recode
base 18000010 :  172.1 MB/s UCS to UTF8 conversion
base 18000010 :  158.2 MB/s UTF8 to UCS conversion
base 18000010 :   86.9 MB/s UTF8 recode
base 1a000010 :  171.3 MB/s UCS to UTF8 conversion
base 1a000010 :  158.2 MB/s UTF8 to UCS conversion
base 1a000010 :   86.5 MB/s UTF8 recode
base 1c000010 :  169.5 MB/s UCS to UTF8 conversion
base 1c000010 :  158.5 MB/s UTF8 to UCS conversion
base 1c000010 :   86.5 MB/s UTF8 recode
base 1e000010 :  173.0 MB/s UCS to UTF8 conversion
base 1e000010 :  157.8 MB/s UTF8 to UCS conversion
base 1e000010 :   86.1 MB/s UTF8 recode
base 20000010 :  173.1 MB/s UCS to UTF8 conversion
base 20000010 :  158.4 MB/s UTF8 to UCS conversion
base 20000010 :   87.2 MB/s UTF8 recode
base 22000010 :  173.1 MB/s UCS to UTF8 conversion
base 22000010 :  158.2 MB/s UTF8 to UCS conversion
base 22000010 :   88.2 MB/s UTF8 recode
base 24000010 :  173.3 MB/s UCS to UTF8 conversion
base 24000010 :  158.8 MB/s UTF8 to UCS conversion
base 24000010 :   87.6 MB/s UTF8 recode
base 26000010 :  173.8 MB/s UCS to UTF8 conversion
base 26000010 :  158.1 MB/s UTF8 to UCS conversion
base 26000010 :   87.9 MB/s UTF8 recode
base 28000010 :  172.0 MB/s UCS to UTF8 conversion
base 28000010 :  158.7 MB/s UTF8 to UCS conversion
base 28000010 :   87.6 MB/s UTF8 recode
base 2a000010 :  172.0 MB/s UCS to UTF8 conversion
base 2a000010 :  158.1 MB/s UTF8 to UCS conversion
base 2a000010 :   86.1 MB/s UTF8 recode
base 2c000010 :  172.1 MB/s UCS to UTF8 conversion
base 2c000010 :  158.5 MB/s UTF8 to UCS conversion
base 2c000010 :   86.3 MB/s UTF8 recode
base 2e000010 :  173.8 MB/s UCS to UTF8 conversion
base 2e000010 :  158.2 MB/s UTF8 to UCS conversion
base 2e000010 :   89.3 MB/s UTF8 recode
base 30000010 :  170.0 MB/s UCS to UTF8 conversion
base 30000010 :  158.7 MB/s UTF8 to UCS conversion
base 30000010 :   87.2 MB/s UTF8 recode
base 32000010 :  173.8 MB/s UCS to UTF8 conversion
base 32000010 :  158.4 MB/s UTF8 to UCS conversion
base 32000010 :   87.8 MB/s UTF8 recode
base 34000010 :  173.5 MB/s UCS to UTF8 conversion
base 34000010 :  158.5 MB/s UTF8 to UCS conversion
base 34000010 :   88.0 MB/s UTF8 recode
base 36000010 :  173.1 MB/s UCS to UTF8 conversion
base 36000010 :  158.4 MB/s UTF8 to UCS conversion
base 36000010 :   88.4 MB/s UTF8 recode
base 38000010 :  172.5 MB/s UCS to UTF8 conversion
base 38000010 :  158.5 MB/s UTF8 to UCS conversion
base 38000010 :   88.7 MB/s UTF8 recode
base 3a000010 :  170.5 MB/s UCS to UTF8 conversion
base 3a000010 :  158.4 MB/s UTF8 to UCS conversion
base 3a000010 :   87.4 MB/s UTF8 recode
base 3c000010 :  169.3 MB/s UCS to UTF8 conversion
base 3c000010 :  154.7 MB/s UTF8 to UCS conversion
base 3c000010 :   86.3 MB/s UTF8 recode
base 3e000010 :  172.0 MB/s UCS to UTF8 conversion
base 3e000010 :  156.9 MB/s UTF8 to UCS conversion
base 3e000010 :   87.9 MB/s UTF8 recode
base 40000010 :  171.8 MB/s UCS to UTF8 conversion
base 40000010 :  148.9 MB/s UTF8 to UCS conversion
base 40000010 :   83.7 MB/s UTF8 recode
base 42000010 :  173.1 MB/s UCS to UTF8 conversion
base 42000010 :  149.3 MB/s UTF8 to UCS conversion
base 42000010 :   88.1 MB/s UTF8 recode
base 44000010 :  173.7 MB/s UCS to UTF8 conversion
base 44000010 :  157.7 MB/s UTF8 to UCS conversion
base 44000010 :   87.4 MB/s UTF8 recode
base 46000010 :  172.1 MB/s UCS to UTF8 conversion
base 46000010 :  152.2 MB/s UTF8 to UCS conversion
base 46000010 :   87.3 MB/s UTF8 recode
base 48000010 :  174.0 MB/s UCS to UTF8 conversion
base 48000010 :  142.7 MB/s UTF8 to UCS conversion
base 48000010 :   87.8 MB/s UTF8 recode
base 4a000010 :  173.5 MB/s UCS to UTF8 conversion
base 4a000010 :  150.2 MB/s UTF8 to UCS conversion
base 4a000010 :   87.7 MB/s UTF8 recode
base 4c000010 :  174.0 MB/s UCS to UTF8 conversion
base 4c000010 :  158.4 MB/s UTF8 to UCS conversion
base 4c000010 :   88.4 MB/s UTF8 recode
base 4e000010 :  174.0 MB/s UCS to UTF8 conversion
base 4e000010 :  156.9 MB/s UTF8 to UCS conversion
base 4e000010 :   88.4 MB/s UTF8 recode

real	2m0.751s
user	2m0.540s
sys	0m0.190s

The above run is on a AMD Phenom(tm) II X4 955 Processor running at the stock clock rate.

Code is in: http://svn.bannister.us/public/json-c/
(A start on an experiment in fastest-possible JSON conversion.)

Note that I intentionally allow proper conversion of “invalid” UTF8 code strings. I completely understand the reason for the disallowed conversions, and I disagree.

Update: Converted to use pointer arithmetic, rather than array and index. Was not sure pointer math was still a win on current CPUs and compilers. Got a big boost in throughput, so it is!

Posted in Software, Web | Leave a comment

… status as Invited Expert in HTML Working Group

At one time I had hoped there was a small chance I might be able to nudge the HTML working group in a constructive direction. Over time, what I found is that there are a small number of individuals that are able to invest an inordinate amount of time to this same working group, and I cannot possibly invest the time to construct thoughtful responses to the flood ill-considered notions.

There is almost no chance I can move the working group is a useful direction. Time to disconnect.

This is all rather discouraging. The HTML working group will proceed. Some of the work is worthwhile. Much (measured by volume of email list traffic) is not. What mix will make it into the generated proposed “standard” is sure to be a mess. Not sure how to change any of this.

My status as an “Invited Expert” is up for renewal. With extreme reluctance … my judgement is that I cannot make a useful contribution, and should disassociate from the HTML working group. Of course, they will continue on the present course, in my absence. My withdrawal makes no difference of significance. There is a fair chance the body of work from this working group will be adopted, imperfect as it is. The existing body of work is … badly skewed by an imperfect process.

Nothing meaningful I can do. The result will be a mess, and will create a mess for years after. Time to disengage.

Funny bit – I do not see a way to force a disconnect.

Posted in Web, html@w3c | Leave a comment