Server-side parsing of HTML to DOM
Ran into an unexpected problem.
Had this bit of inspiration for what I thought would be an optimally performant wiki/weblog, and started putting together a prototype. The client-side Javascript went together pretty easily. The server-side is presenting more of a problem.
The basic notion is I want to send an HTML fragment to the server. The server-side code would then:
- Read and parse an HTML file from disk into a DOM tree.
- Replace a DIV identified by a ID attribute with the HTML fragment received from the client.
- Save the edited HTML into a file on disk.
Pretty simple, right?
The application-specific code on the server is very small. I was hoping to keep the server-side code simple so multiple implementations (PHP, Perl, Java) would be feasible. This would allow widest possible usage.
Granted a parser for sloppy HTML is not trivial. One the other hand, you only need one good open source HTML parser for all the languages implemented in C/C++. I know there are a couple for Java. By now I had hoped that HTML -> DOM (and back) would be relatively common.
Guess I was wrong.
Pulled up the PHP documentation and found the DOM functions. Hmmm, PHP4 apparently only supported XML, which is not good enough. I want a parser able to tolerate less-than-perfect HTML. PHP5 apparently has a more tolerant parser, though I’d hoped not to have to require the very latest version. Oh well … wrote code to slurp the HTML fragment from the client (OK), slurp the HTML file off disk (OK), find the DIV to replace with getElementById() … nope. The file is HTML 4.01 Strict (been through the validator) and getElementById() works from browser Javascript. Looks like the newish PHP DOM functions may be a bit buggy still.
Don’t see an alternative, so PHP may be a non-starter.
My web host allows use of PHP, Perl, Python, and Ruby … so off to do some reading.
Update: Found my way around the problem with the PHP DOM functions. A bit of a hack, but not too bad. What I really want is an assignable equivalent to innerHTML, as this would be simpler and more efficient. Oh well, sometimes you work with what you have.