Server-side parsing of HTML to DOM

2006-11-11

Ran into an unexpected problem.

Had this bit of inspiration for what I thought would be an optimally performant wiki/weblog, and started putting together a prototype. The client-side Javascript went together pretty easily. The server-side is presenting more of a problem.

The basic notion is I want to send an HTML fragment to the server. The server-side code would then:

Read and parse an HTML file from disk into a DOM tree.
Replace a DIV identified by a ID attribute with the HTML fragment received from the client.
Save the edited HTML into a file on disk.

Pretty simple, right?

The application-specific code on the server is very small. I was hoping to keep the server-side code simple so multiple implementations (PHP, Perl, Java) would be feasible. This would allow widest possible usage.

Granted a parser for sloppy HTML is not trivial. One the other hand, you only need one good open source HTML parser for all the languages implemented in C/C++. I know there are a couple for Java. By now I had hoped that HTML -> DOM (and back) would be relatively common.

Guess I was wrong.

Pulled up the PHP documentation and found the DOM functions. Hmmm, PHP4 apparently only supported XML, which is not good enough. I want a parser able to tolerate less-than-perfect HTML. PHP5 apparently has a more tolerant parser, though I'd hoped not to have to require the very latest version. Oh well ... wrote code to slurp the HTML fragment from the client (OK), slurp the HTML file off disk (OK), find the DIV to replace with getElementById() ... nope. The file is HTML 4.01 Strict (been through the validator) and getElementById() works from browser Javascript. Looks like the newish PHP DOM functions may be a bit buggy still.

Don't see an alternative, so PHP may be a non-starter.

My web host allows use of PHP, Perl, Python, and Ruby ... so off to do some reading.

Update: Found my way around the problem with the PHP DOM functions. A bit of a hack, but not too bad. What I really want is an assignable equivalent to innerHTML, as this would be simpler and more efficient. Oh well, sometimes you work with what you have.